In [1]:
%matplotlib inline
import numpy as np
import pandas as pd

import os
import sys

# Append the parent directory to sys.path so that utils can be found
sys.path.append(os.path.join(sys.path[0], os.path.pardir))

from utils import StringHandler as SH
from utils import NormalDistribution as ND
from utils import CorrelationCoefficient as CC

### **6.1/ AN INTUITIVE APPROACH**

#### **Positive Relationship:**
- Definition:
    + Positive Relationship: occurs insofar as pairs of scores tend to occupy similar relative positions (high with high, low with low) in their respective distributions.

#### **Negative Relationship:**
- Definition:
    + Negative Relationship: occurs insofar as pairs of scores tend to occupy dissimilar relative positions (high with low and vice versa) in their respective distributions.

#### **Little or No Relationship:**
- Unlike Positive and Negative Relationships, in which there's a pronounced pattern of scores distributions, that is, either high scores are associated with high scores or high scores are associated with low scores, Little or No Relationship occurs insofar as pairs of scores do not have any particular pattern, so that high scores might be associated with high scores as well as low scores.

#### **Review:**

#### **Progress Check 6.1: Indicate whether the following statements suggest a positive or negative relationship**

(a) More densely populated areas have higher crime rates. <br>
Positive relationship. <br>
(b) Schoolchildren who often watch TV perform more poorly on academic achievement tests. <br>
Negative relationship. <br>
(c) Heavier automobiles yield poorer gas mileage. <br>
Negative relationship. <br>
(d) Better-educated people have higher incomes. <br>
Positive relationship. <br>
(e) More anxious people voluntarily spend more time performing a simple repetitive task. <br>
Positive relationship. <br>

### **6.2/ SCATTERPLOTS**

- Definition:
    + Scatterplot: a graph containing a cluster of dots the represents all pairs of scores.

#### **Construction:**

#### **Positive, Negative, or Little or No Relationship:**
- Patterns of relationships:
    + Positive: when the dot cluster has a slope from the lower left to the upper right.
    + Negative: when the dot cluster has a slope from the upper left to the lower right.
    + Little or No Relationship: when the dot cluster lacks of any slope.

#### **Strong or Weak Relationship?**
- The more closely the dot cluster resembles a straight line, the stronger (the more regular) the relationship will be.

#### **Perfect Relationship:**
- When the dot cluster equals (exactly) a straight line, then the relationship is perfect, which is rather unlike in practice.

#### **Curvilinear Relationship:**
- Definition:
    + Linear Relationship: when the dot cluster approximates a straight line, the relationship is considered linear.
    + Curvilinear Relationship: when the dot cluster approximates a curve, the relationship is considered curvilinear.
- Unlike linear relationships, curvilinear ones include both negative and positive properties at the same time, however, the judment of relationship's strength is still read as how much the dot cluster resembles the shape.

#### **Progress Check 6.2: Critical reading and math scores on the SAT test for students A, B, C, D, E, F, G, and H are shown in the following scatterplot**
![image.png](attachment:7a92d6b3-6125-42dd-9a12-a817960d339b.png)

(a) Which student(s) scored about the same on both tests? <br>
I, D, F. <br>
(b) Which student(s) scored higher on the critical reading test than on the math test? <br>
F, B, H, E. <br>
(c) Which student(s) will be eligible for an honors program that requires minimum scores of 700 in critical reading and 500 in math? <br>
H, E. <br>
(d) Is there a negative relationship between the critical reading and math scores? <br>
No. The relationship approximates a positive one.

### **6.3/ A CORRELATION COEFFICIENT FORMULA FOR QUANTITATIVE DATA: <i>r</i>**

- Definition:
    + Correlation Coefficient: a number between -1 and 1 that describes the relationship between pairs of variables.

#### **Key Properties of <i>r</i>:**
- Pearson Correlation Coefficient (<i>r</i>): a number between -1.00 and 1.00 that describes the linear relationship between pairs of quantitative variables.
- The key properties of <i>r</i>:
    + The sign of r indicates the type of linear relationship, either positive or negative.
    + The numerical value of r, regardless of its sign, indicates the strength of the linear relationship.

#### **Sign of <i>r</i>:**

#### **Numerical Value of <i>r</i>:**
- The more the absolute value of r approaches 1, the stronger the relationship is. On the other hand, the more the absolute value of r approaches 0, the weaker the relationship is.
- From a slightly different perspective, the value of r is a measure of how well the straight line (in the context of linear relationship) describes the dot cluster in a scatterplot.

#### **Interpretation of <i>r</i>:**
- As a tool for generalization, the <i>r</i> cannot be interpreted at face value.
- Further than only taking the value of <i>r</i> for granted, the number of data that is used in order to calculate <i>r</i> should also be taken into account.
- An <i>r</i> of :
    + 0 <= abs(r) < 0.1: indicates little or no correlation.
    + 0.1 <= abs(r) < 0.3: indicates a weak correlation.
    + 0.3 <= abs(r) < 0.5: indicates a moderate correlation.
    + 0.5 <= abs(r) < 1: indicates a strong correlation.
- When the <i>r</i> is used to measure "test reliability", the number reading of 0.80 is expected.

#### **<i>r</i> Is Independent of Units of Measurement:**
- Regardless of the units of measurement, <i>r</i> is dependent of only the pattern among pairs of scores, as:
    + A positive value of <i>r</i> reflects the tendency for pairs of scores to occupy similar relative locations on their respective distributions.
    + Otherwise, a negative value of <i>r</i> reflects the tendency for pairs of scores to occupy dissimilar/opposite locations on their repestive distributions.

#### **Range Restrictions:**
- Except for special circumstances, the value of the correlation coefficient declines whenever the range of possible X and Y scores is restricted. Range restriction is analogous to magnifying a subset of the original dot cluster and, in the process, losing much of the orderly and predictable pattern in the original dot cluster.

#### **Caution:**
- The <i>r</i> cannot be interpreted as a proportion or percentage of some perfect relationship.

#### **Verbal Descriptions:**
- If possible, translate the value of <i>r</i> into a verbal phrase that stresses the pattern of the relationship between variables instead of focusing on the face value.

#### **Progress Check 6.3:** Supply a verbal description for each of the following correlations. (If necessary, visualize a rough scatterplot for r, using the scatterplots in Figure 6.3 as a frame of reference.)

(a) an r of –.84 between total mileage and automobile resale value. <br>
The resale value of cars depreciate significantly as their mileages accumulate. <br>
(b) an r of –.35 between the number of days absent from school and performance on a math achievement test. <br>
The performance on a math test tends to be lower as the number of days absent from school is higher. <br>
(c) an r of .03 between anxiety level and college GPA. <br>
There is a little of correlation between anxiety level and college GPA. <br>
(d) an r of .56 between age of schoolchildren and reading comprehension. <br>
Schoolchildren of higher ages tend to have higher reading comprehension.

#### **Correlation Not Necessarily Cause-Effect:**
- A correlation coeffcient, regardless of size, never provides information about whether an observed relationship reflects a simple cause-effect relationship or some more complex state of affairs.

#### **Role of Experimentation:**
- When in doubt of dependent variables might involve in the reading of the value of correlation coefficient, always seek for experimentations and control of independent variables, under such controlled environment, the correlation is read off more reliable.

#### **Progress Check 6.4:** Speculate on whether the following correlations reflect simple cause-effect relationships or more complex states of affairs. (Hint: A cause-effect relationship implies that,if all else remains the same, any change in the causal variable should always produce a predictable change in the other variable.)

(a) caloric intake and body weight. <br>
Simple cause-effect. <br>
(b) height and weight. <br>
More complex state of affairs. <br>
(c) SAT math score and score on a calculus test. <br>
Complex. <br>
(d) poverty and crime. <br>
More complext state of affairs.

### **6.4/ DETAILS: COMPUTATION FORMULA FOR <i>r</i>**

<center><b>CORRELATION COEFFICIENT (COMPUTATION FORMULA)</b></center>
<center>$\Large \it r = \frac{SP_{xy}}{\sqrt{{SS_x}{SS_y}}}$<center>
<br><br>
<center><b>SUM OF PRODUCTS (DEFINITION AND COMPUTATION FORMULAS)</b></center>
<center>$\Large SP_{xy} = \sum{(X - \overline{X})(Y - \overline{Y})} = \sum{XY} - \frac{(\sum{X})(\sum{Y})}{n}$</center>

#### **Progress Check 6.5:** Couples who attend a clinic for first pregnancies are asked to estimate (independently of each other) the ideal number of children. Given that X and Y represent the estimates of females and males, respectively, the results are as follows:
![image.png](attachment:3b263ac1-0da9-4104-baed-6551abe966b7.png) <br>
Calculate a value for r, using the computation formula.

In [None]:
data_str = ["A 1 2",
            "B 3 4",
            "C 2 3",
            "D 3 2",
            "E 1 0",
            "F 2 3"]
classes, columns = SH.to_data(data_str, ncol=2, name_maxlength=1)
corr_coef = CC.calc_correlation(columns[:, 0], columns[:, 1])
corr_coef

### **6.5/ OUTLIERS AGAIN**

#### **Greeting Card Study Revisited:**
- Sometimes the outliers in a scatterplot, when considering data points singularly, are not extreme values comparatively to their distributions, rather, it is their combination that makes them outliers.
- Outliers can cause the correlation coefficient to deviate greatly from the value that one would obtained without them, therefore, may distort the interpretation of the relationship of most data in the dot cluster.

#### **Dealing with Outliers:**
- The most defensible strategy is to report the correlation coefficient of both with and without any outliers.

### **6.6/ OTHER TYPES OF CORRELATION COEFFICIENTS**

#### **Spearman Rank-Order:**

#### **Point-biserial Correlation Coefficient <i>$r_{pbi}$</i>:**

##### **Definition:**
+ <i>$r_{pbi}$</i>: a statistical measurement used to estimate the degree of relationship between a naturally occuring dichotomous (in which the axis is divided into two branches) nominal scale and an interval (or ratio) scale.

##### **Usage:**
+ To determine whether there exists a discrimination between two groups in some investigation (e.g., a group is better in some task than the other).
+ In practice, it is common to calculate the correlation coefficient of groups with their sizes being greater than two dozens.

##### **Formula:**
Let each group be coded as 1 and 0, respectively (these codes only serve as nominal values, or identifications to distinguish one group from another) <br>
<center><b>POINT-BISERIAL CORRELATION COEFFICIENT FOR POPULATION</b></center>
<center>$\Large \it r_{pbi} = \frac{\mu_p - \mu_q}{\sigma}\sqrt{pq}$</center>
<br>
   Where: <br>
       + $\mu_p$: mean of scores of the group coded as 1 <br>
       + $\mu_q$: mean of scores of the group coded as 0 <br>
       + $\sigma$: population standard deviation <br>
       + p: proportion of the group coded 1 in the population <br>
       + q = 1 - p: proportion of the group coded as 0 in the population.

##### **Interpretation:**
+ Like the Pearson <i>r</i>, the value $r_{pbi}$ ranges from -1.00 to 1.00, indicating the direction and the strength of relationship.
+ However, rather than stating the fact that there's a relationship between a group and the scores, $r_{pbi}$ implies the direction of the relationship between one group the scores relatively to the other group. 
+ For example: a distribution of math test scores of two groups of students, male (coded as 1) and female (coded as 0), is recored, an $r_{pbi}$ reading value of 0.5 will indicate that the male group performs better on the test than the female group, as given by the data in the distribution.

##### **Example:**

In [None]:
scores = np.random.uniform(2.00, 4.00, size=100)
males = np.unique(np.random.randint(0, 99, size=50))

arr = np.arange(0, 100)
females = list()
for idx in arr:
    if idx not in males:
        females = np.concatenate([females, [arr[idx]]])
    else:
        continue

females = females.astype(np.int16)
        
males = scores[males]
females = scores[females]

r_pbi = CC.pbi_coefficient(males, females, scores)
r_pbi

#### **Biserial Correlation Coefficient <i>$r_{bi}$</i>:**

#### **Phi:**

#### **Tetrachoric:**

#### **Gamma:**

#### **Summary:**
- Truth table of correlation coefficient types and their use cases: <br>
![image.png](attachment:5870fed5-0cc7-44a1-89a5-b2227a01a180.png)

### **6.7/ COMPUTER OUTPUT**

#### **Reading a Larger Correlation Matrix:**

#### **Interpreting a Larger Correlation Matrix:**

#### **Progress Check 6.6:** Refer to Table 6.5 when answering the following questions.

(a) Would the same positive correlation of .2981 have been obtained between GENDER and HIGH SCHOOL GPA if the assignment of codes had been reversed, with females being coded as 1 and males coded as 2? Explain your answer. <br>
No. The correlation coefficient would be then -0.2981, since the reverse of codes assignment would change the positions of supplied input for the point-biserial formula, therefore reverses the sign of the final output. <br>
(b) Given the new coding of females as 1 and males as 2, would the results still permit you to 
conclude that females tend to have higher high school GPAs than do males? <br>
Yes. Given the reading value of r be -0.2981, it would indicate that male students tend to have lower GPAs than female students, this is merely another way of saying females students tend to have higher GPAs than male students. <br>
(c) Would the original positive correlation of .2981 have been obtained if, instead of the original coding of males as 1 and females as 2, males were coded as 10 and females as 20? Explain your answer. <br>
Yes. The codes only reflect the nomial values of different groups and their face values are not taken into actual calculation. <br>
(d) Assume that the correlation matrix includes a fifth variable. What would be the totalnumber of relevant correlations in the expanded matrix? <br>
10. The fifth variable would add 4 new correlations to the original six.

### **REVIEW QUESTIONS**

#### **6.7.**

(a) Estimate whether the following pairs of scores for X and Y reflect a positive relationship, a negative relationship, or no relationship. Hint: Note any tendency for pairs of X and Y scores to occupy similar or dissimilar relative locations. <br>
![image.png](attachment:df287f5c-6d73-4bd9-a644-31a49142f448.png) <br>
Negative. <br>
(b) Construct a scatterplot for X and Y. Verify that the scatterplot does not describe a pronounced curvilinear trend. <br>
(c) Calculate r using the computation formula (6.1).

In [2]:
# Solution:
data_str = ["64 66", "40 79", "30 98",
            "71 65", "55 76", "31 83",
            "61 68", "42 80", "57 72"]
ignored, data = SH.to_data(data_str, 2, 0)

In [None]:
# b):
plot = CC.draw_scatterplot(data[:, 0], data[:, 1])

In [None]:
# c):
r = CC.pearson_coefficient(data[:, 0], data[:, 1])
r

#### **6.8.** Calculate the value of r using the computational formula (6.1) for the following data.
![image.png](attachment:1a88e4e8-67d6-458f-a322-b4d5280ab130.png)

In [6]:
# Solution:
data_str = ["2 8","4 6","5 2",
            "3 3","1 4","7 1","2 4"]
ignored, data = SH.to_data(data_str, 2, 0)
r = CC.pearson_coefficient(data[:, 0], data[:, 1])
r

-0.61

#### **6.9.** Indicate whether the following generalizations suggest a positive or negative relationship. Also speculate about whether or not these generalizations reflect simple cause-effect relationships.

(a) Preschool children who delay gratification (postponing eating one marshmallow to win two) subsequently receive higher teacher evaluations of adolescent competencies. <br>
Positive. Complex state of affairs. <br>
(b) College students who take longer to finish a test perform more poorly on that test. <br>
Negative. Complex state of affairs. <br>
(c) Heavy smokers have shorter life expectancies. <br>
Negative. Cause-effect. <br>
(d) Infants who experience longer durations of breastfeeding score higher on IQ tests in later childhood. <br>
Positive. Cause-effect.

#### **6.10.** On the basis of an extensive survey, the California Department of Education reported an r of –.32 for the relationship between the amount of time spent watching TV and the achievement test scores of schoolchildren. Each of the following statements represents a possible interpretation of this finding. Indicate whether each is True or False.

(a) Every child who watches a lot of TV will perform poorly on the achievement tests. <br>
False. This statement would be true only if a perfect negative relationship (-1.00) described the relationship between TV viewing time and test scores.<br>
(b) Extensive TV viewing causes a decline in test scores. <br>
False. Correlation does not necessarily signify cause-effect. <br>
(c) Children who watch little TV will tend to perform well on the tests. <br>
True. <br>
(d) Children who perform well on the tests will tend to watch little TV. <br>
True. <br>
(e) If Gretchen’s TV-viewing time is reduced by one-half, we can expect a substantial improvement in her test scores. <br>
False. Because of b)<br>
(f) TV viewing could not possibly cause a decline in test scores. <br>
False. Even though correlation does not necessarily signify cause-effect, it indeed opens possibility of cause-effect.

#### **6.11.** Assume that an r of .80 describes the relationship between daily food intake, measured in ounces, and body weight, measured in pounds, for a group of adults. Would a shift in the units of measurement from ounces to grams and from pounds to kilograms change the value of r ? Justify your answer.

The r measures the relative standings of all pairs of scores within a distribution. A change in the unit of measurement does not change the relative standing of each score to their mean, therefore does not affect the overall value of r.

#### **6.12.** An extensive correlation study indicates that a longer life is experienced by people who follow the seven “golden rules” of behavior, including moderate drinking, no smoking, regular meals, some exercise, and eight hours of sleep each night. Can we conclude, therefore, that this type of behavior causes a longer life?

It wouldn't be appropriate to conclude this cause-effect type of relationship, given only the information supplied by the study, rather, we could only conclude that life expectancies tend to be longer for the people that have a benefical and healthy lifestyle.