# Bivariate Data Analysis

Bivariate Data has two variables, we previously learned about Univariate (one variable)

We Commonly use **Scatterplots** to display bivariate data and see if their is any correlation b/w the two variables. <br /> Commonly we place the ***Explanatory Variable*** on the X-axis to see if it had any effect on the Y-Axis Variable

![](Images/scatter-plot.png)

For example: the graph above shows that these two variables have a **Positively Associated** relationship - as the independent variable is increased; the dependent variable increases as well

## Correlation

**Linear Relationship:** Two variables are linearly related to the extent that their relationship can be modeled by a line. <br /> **Correlation Coefficent:** The primary statistic we have to determine a linear relationship (or Lack thereof)

![](Images/corr_coef.png)

### The Corrleation Coefficent (r) must be: -1 <= r <= 1 <br /> If r= -1 or 1 all the points will lie on the line

![](Images/corr_strength.png)

If r > 0 There is a positive relationship <br />
If r = 0 there is no linear relationship <br />
r is not resistant to extreme values b/c is is based on the mean. A single extreme value can throw things off

# Let's Try and make a function to calculate r

In [3]:
def mean(a_list):
    return sum(a_list) / len(a_list)

In [4]:
def standard_dev(a_list):
    mean = sum(a_list) / len(a_list)
    x_sum = 0
    
    for num in a_list:
        x_sum += (num - mean)**2
        
    return (x_sum* 1/(len(a_list)-1))**.5

In [5]:
def z_score(num, sdv, mean):
    
    return round((num - mean) / sdv,2)

In [36]:
# Relationship b/w students hours studied (x) and test scores (y) placed in a tuple
study_test_relationship = [(.5,65), (2.5,80), (3,77), (1.5,60), (1.25,68), (.75,70), (4,83), (2.25,85), (1.5,70), (6,96), (3.25, 84), (2.5,84), (0,51), (1.75,63), (2,71)]

In [15]:
print(study_test_relationship)

[(0.5, 65), (2.5, 80), (3, 77), (1.5, 60), (1.25, 68), (0.75, 70), (4, 83), (2.25, 85), (1.5, 70), (6, 96), (3.25, 84), (2.5, 84), (0, 51), (1.75, 63), (2, 71)]


In [6]:
def correlation_coefficient(data_set):
    
    x_list = []
    y_list = []
    z_sum = 0
    n = len(data_set)
    
    for x,y in data_set:
        x_list.append(x)
        y_list.append(y)
        
    x_mean = mean(x_list)
    y_mean = mean(y_list)
    
    x_sdv = standard_dev(x_list)
    y_sdv = standard_dev(y_list)
    
        
    for x_num,y_num in data_set:
        z_sum += (z_score(x_num, x_sdv, x_mean) * z_score(y_num, y_sdv,y_mean))

    return ((z_sum) / (n-1))

In [65]:
correlation_coefficient(study_test_relationship)

0.8629785714285713

# Least-Squares Regression Line

**Regression Line:** A Line that can be used for predicting response values from explanatory variables

![](Images/least_squares_regression.jpg)

![](Images/least-square-regression2.png)

In [1]:
# Let's do this for the study example
study_test_relationship = [(.5,65), (2.5,80), (3,77), (1.5,60), (1.25,68), (.75,70), (4,83), (2.25,85), (1.5,70), (6,96), (3.25, 84), (2.5,84), (0,51), (1.75,63), (2,71)]

In [2]:
x_list = []
y_list = []

for x,y in study_test_relationship:
    x_list.append(x)
    y_list.append(y)

In [3]:
print(x_list)

[0.5, 2.5, 3, 1.5, 1.25, 0.75, 4, 2.25, 1.5, 6, 3.25, 2.5, 0, 1.75, 2]


In [4]:
print(y_list)

[65, 80, 77, 60, 68, 70, 83, 85, 70, 96, 84, 84, 51, 63, 71]


### Numbers crunched in stats module

We Want to predict what score someone would get if they studied for 2.75 hours

In [18]:
b = round(.863 * (11.75/ 1.5),2) # b= r * (sdv_y / sdv_x)

In [19]:
b

6.76

In [25]:
a = 73.8 - (b*2.18) # a = mean_y - b * mean_x

In [26]:
a

59.063199999999995

In [27]:
y_hat = a+ (b*2.75) # predicted y value = a + b * trial x value (2.75)

In [28]:
y_hat

77.6532

The Regression equation can be thought of as: score = 59 + 6.77 hours

### a is the y-intercept and b is the slope of the line

In our example, a = 59 which is the predicted score when hours studied (x) is zero.

In [66]:
regression_line(study_test_relationship)

 ŷ = 59.05+6.76x


# Regression Line function

In [7]:
def regression_line(a_list,give_values=False):
    x_list = []
    y_list = []

    for x,y in a_list:
        x_list.append(x)
        y_list.append(y)
        
    sdv_y = standard_dev(y_list)
    sdv_x = standard_dev(x_list)
    
    y_mean = mean(y_list)
    x_mean = mean(x_list)
    r = correlation_coefficient(a_list)
        
    b = r * (sdv_y / sdv_x) # b= r * (sdv_y / sdv_x)
    a = y_mean - (b*x_mean) # a = mean_y - b * mean_x
    
    if give_values == False:
        print(f" ŷ = {round(a,2)}+{round(b,2)}x")
    
    if give_values == True:
        return a, b

# Predict Y Value of data set

In [8]:
def predict_y(a_list, x_predict):
    
    a,b = regression_line(a_list, True)
        
    y_predict = a + (b * x_predict)
    
    return y_predict

In [11]:
# New Book Example
criminal = [(300,200), (880,380), (1000,400), (1540,200), (1560,800), (1600,600), (1600,800), (2200,1000), (3200,1600), (6000,2700)]

# Predict X̂ 

In [9]:
def predict_x(a_list, y_value):
    
    a,b = regression_line(a_list, True)
        
    x_predict = (y_value - a) / b
    
    return round(x_predict,3)

In [13]:
predict_x(height_age,156.2)

143.599

### Question A

In [97]:
correlation_coefficient(criminal)

0.9727333333333333

In [98]:
regression_line(criminal)

 ŷ = -56.83+0.47x


### Question C

The Slope .46 says that we would predict that for every 1 increase in salary we would expect .46 increase in restitution payment

### Question D

In [82]:
predict_y(criminal, 1400)

594.459

# Residuals

The formal name for the difference b/w y - ŷ is called the **Residual** <br />
This shows the amount of error in our prediction model (linear regression line)

We have the actual y for a criminal with a salary of 1560 a month; which is 800 in restitution. <br />
Let's see what our model would predict as ŷ

In [83]:
predict_y(criminal, 1560)

668.892

So our Residual for this data point would be 800-669 or 131

### Resdiual plots that show a clear pattern are an indictation tha the data isn't linear

![](Images/non_linear_residuals.PNG)

# Residual

In [10]:
def residual(a_list,x,y):

    y_hat = predict_y(a_list, x)

    return y-y_hat

In [177]:
round(residual(criminal,1560,800),3)

131.108

In [2]:
# Book Example
height_age = [(18,76), (19,77.1), (20,78.1), (21,78.3), (22,78.8), (23, 79.4), (24,79.9), (25,81.3), (26,81.1), (27,82), (28,82.6), (29,83.5)]

# Average Residual

In [11]:
def average_residual(a_list):
    
    n = len(a_list)
    total_yhat = 0
    total_y = 0

    for x_num, y_num in a_list:
        total_yhat += predict_y(a_list, x_num)
        total_y += y_num
        
    total_residual = total_y - total_yhat
    avg_residual = total_residual / n
    
    return round(avg_residual,3)    
        
    

In [16]:
average_residual(height_age)

-0.0

# Total Residual (Sum of Squared Errors)

In [12]:
def total_residual(a_list):
    n = len(a_list)
    total_yhat = 0
    total_y = 0

    for x_num, y_num in a_list:
        total_yhat += predict_y(a_list, x_num)
        total_y += y_num
        
    total_residual = total_y - total_yhat
    
    return total_residual

In [18]:
total_residual(height_age)

-0.0009999999998626663

### Book Question

**a** Does the regression line seem to be a good model for the data?

In [183]:
# Correlation Coefficent is VERY high
round(correlation_coefficient(height_age),3)

0.995

In [184]:
# Our Regression Line Equation
regression_line(height_age,give_values=False)

 ŷ = 64.9+0.64x


In [185]:
# Our Average Residual
average_residual(height_age)

-0.0

Our Regression Line has a VERY high corr coefficient and an average residual of 0, so Yes this line fits the data extraordinarily well

**b** What is the value of the residual for a child at 19 months?

In [186]:
round(residual(height_age,19,77.1),3)

0.119

## Interpolation

If we are trying to predict a value of y from an x value ***within*** the range of values that is interpolation. <br />
If a line has been shown to be a good model for the data and if it fits the line well (have a strong r and low mean residual) we can have confidence in interpolated predictions

In [187]:
predict_y(height_age,19)

76.981

## Extrapolation

When we predict from a value outside of the x values <br /> 
We can ***Rarely*** have confidence in extrapolated values, see example below where at 144 months we'd predict the child to be 13 feet tall!

In [188]:
predict_y(height_age,144)

156.455

# Coefficient of Determination

If we didn't have a regression line and wanted to predict y, our only decent option is to use the mean of ȳ as our estimate for any given X <br />
If we guessed for any given age in height_age that the height was the mean y, we'd probably be a bit off from reality... <br />
So we calculate the sum of the squares of (y-ȳ) and this is called the **Sum of Squares Total** <br />
it represents the total variability of y <br />
Now let's suppose we made a regression line, if we take the total of our regression (y - ŷ) that is called **Sum of Squared Errors** <br />
so ***SST*** represents the error from using the mean of y, ȳ and ***SSE*** represents the error from using the regression line <br />
The porportion of the total variability in y that is explained by the regression of y on x is called the **Coefficient of Determination**

# Sum of Squares Total

In [13]:
def sum_of_squares_total(a_list):
    y_list = []
    sum_squares_total = 0
    
    for x,y in a_list:
        y_list.append(y)
        
    # find mean of y
    y_mean = mean(y_list)
    
    for y in y_list:
        sum_squares_total += (y - y_mean)**2
        
    return sum_squares_total

In [98]:
sum_of_squares_total(height_age)

58.32916666666668

In [96]:
total_residual(height_age)

-0.0009999999998626663

# Sum of Squared Errors

In [23]:
def sum_of_squared_errors(a_list):
    return total_residual(a_list)

# Coefficient of Determination

In [14]:
def coefficient_determination(a_list):
    """
    Gives coefficient of determination via two means, first (SST-SSE)/SST then r**2
    """
    

    sum_squared_errors = total_residual(a_list)
    sum_squares_total = sum_of_squares_total(a_list)

    
    return ((sum_squares_total - sum_squared_errors) / sum_squares_total), correlation_coefficient(a_list)**2

In [95]:
coefficient_determination(height_age)

(1.0000171440817178, 0.9910202499999999)

# Influential Observations

An Influential observation is often an outlier in the x direction. Its influence, if it doesn't line up with the rest of the data, is on the slope of the regression line

### Influential Observation

![](Images/influential_observation.png)

### Non-Influential Observation

![](Images/non_influential_obs.png)

### Example: The Number of a certain type of bacteria present (in thousands) after a certain number of hours is given below

In [60]:
bacteria = [(1,1.8), (1.5,2.4), (2,3.1), (2.5,4.3), (3,5.8), (3.5,8), (4,10.6), (4.5,14), (5,18)]

In [61]:
regression_line(bacteria)

 ŷ = -4.31+3.95x


In [62]:
predict_y(bacteria,3.75)

10.521

In [65]:
total_residual(bacteria)

0.0

In [66]:
for x,y in bacteria:
    print(residual(bacteria,x,y))

2.152
0.7749999999999999
-0.5019999999999998
-1.279
-1.7560000000000002
-1.532
-0.9090000000000007
0.5139999999999993
2.537000000000001


In [71]:
slope = (bacteria[1][1] - bacteria[0][1]) / (bacteria[1][0] - bacteria[0][0])
# y2-y1 / x2-x1

In [72]:
slope

1.1999999999999997

# Data from our bacteria data set
[From stats.cpm](https://stats.cpm.org/scatterplot/)

![](Images/Capture1.PNG)

### Looking at these graphs we see our data doesn't seem to be linear but instead looks rather exponential, so a linear regression line doesn't make sense

In [15]:
import math

We Will need to transform our data set since it isn't linear; we can do this by finding the log of each value g(x)

Our new transformed plot looks much more linear now

![](Images/Capture2.PNG)

# Transform Bivariate Set

In [16]:
def transform_bivariate(a_list):
    transformation = []
    for x, y in a_list:
        transformation.append((x,math.log(y)))
        
    return transformation
        

In [82]:
transformed_bacteria = []
for x,y in bacteria:
    transformed_bacteria.append((x,math.log(y)))

In [83]:
transformed_bacteria

[(1, 0.5877866649021191),
 (1.5, 0.8754687373538999),
 (2, 1.1314021114911006),
 (2.5, 1.4586150226995167),
 (3, 1.7578579175523736),
 (3.5, 2.0794415416798357),
 (4, 2.3608540011180215),
 (4.5, 2.6390573296152584),
 (5, 2.8903717578961645)]

In [84]:
regression_line(transformed_bacteria)

 ŷ = -0.01+0.59x


In [86]:
predict_y(transformed_bacteria,3.75)

2.193

We can now try and predict y given x=3.75 with our new transformed regression line

However, 2.193 is ln(number) not just number so we have to adjust it, e^2.19 = **8.935**

# Transform Predicted Y

In [17]:
def transform_predicted_y(a_list,x_predict):
    """
    Takes the predicted y value and e^x to fix it
    """
    
    y_predicted = predict_y(a_list,x_predict)
    
    return math.exp(y_predicted)

In [128]:
transform_predicted_y(transformed_bacteria,3.75)

8.962059002740496

# Rapid Review

## 1
There's a strong positive correlation in this data set

## 2
No, these residuals show a clear pattern and are thus it is unlikely a linear line will fit the data set well

## 3 
It is an outlier as it is far outside the normal data set and it is an influential observation as it is an outlier in the x direction.

## 4
For every hour studied GPA is predicted to rise by .11

## 5
r^2=.45 means that 45% of the variability in college success is explained by the regression of GPA on socioeconomic status

## 6
This is a powerful correlation but we need more data to assess wether or not its likely that the Governor is affecting this downward trend in crime: Correlation != Causation

## Free Response

### 1

In [99]:
# b = .8(11/4) = 2.2, a = 20-2.2*14.5 = -11.9 

y = -11.9 + 2.2x

### 2

In [102]:
test_test = [(63,51), (32,21), (87,52), (73,90), (60,83), (63,54), (83,73), (80,85),(98,83), (85,46)]

In [103]:
correlation_coefficient(test_test)

0.5500111111111112

![](Images/Free_Response2.PNG)

They don't seem very  similar, however the correlation is about .55 which is fairly high

### 3

Regression plots that show a clear pattern indicate a line isn't a good fit (not sure why this is so axiomatically). <br />
A residual is actual Y vs predicted y, so the square is a positive number, meaning actual > predicted so the residual should underestimate the y-value

### 4

In [105]:
swimmer = [(1,77.3), (2,80.2), (3,77.1), (4,76.4), (5,75.5), (6,75.9), (7,75.1), (8, 74.3)]

In [106]:
regression_line(swimmer)

 ŷ = 79.21+-0.61x


In [109]:
predict_x(swimmer, 60)

31.647

The predict x equation is (y_value - a) / b, so 60-79.21/(-.61) = 31 years... <br />
This seems crazy, and this is an example of extrapolation or predicting something beyond a given x value; which tends to give wild results <br />
In this case we're assuming the swimmer will always improve and never (even over 30 years!) regress

### 5

In [110]:
roaches = [(2,3), (5,4.5), (8,6), (11,7.9), (14,11.5)]

In [111]:
regression_line(roaches)

 ŷ = 1.16+0.68x


In [112]:
predict_y(roaches,9)

7.257

We would predict approx. 7.3 roaches after 9 days

### My Answer above is incorrect; remember to check the scatterplot first!

![](Images/roaches.PNG)

When looking at this scatterplot we see the data seems exponential not linear, so we'll need to use ln()

In [120]:
transformed_roaches = transform_bivariate(roaches)

In [121]:
transformed_roaches

[(2, 1.0986122886681098),
 (5, 1.5040773967762742),
 (8, 1.791759469228055),
 (11, 2.066862759472976),
 (14, 2.4423470353692043)]

In [122]:
regression_line(transformed_roaches)

 ŷ = 0.92+0.11x


In [129]:
transform_predicted_y(transformed_roaches,9)

6.606143173596939

#### So we would predict 6.6 roaches after 9 days

### 6

Correlation != Causation, so the advice may be beneficial but we don't know for sure on just correlation. A more exhaustive look at these two variables is neccessary. For instance has the reporter looked into whether the variables affects are reversed? Perhaps healthier people simply like to walk more?

### 7

I Can't see the data points for the data plotted on a scatter plot so I don't know for sure if these correlations and correlations of determination are relevant. <br />
Assuming it is linear, then yes that's an enormous correlation, if 72% of the variance is explained by the explanatory variable of socieoeconomic status, then the original corr was (Square root of .72 =) 85%.

### 8

I said b,c,e

Answers: **b,c,d**

### 9

In [135]:
#y-y_hat
data = [(45,15), (73,7.9), (82,5.8), (91,3.5)]
predict_y(data,73)

7.987593315617893

In [136]:
regression_line(data)

 ŷ = 26.21+-0.25x


In [137]:
7.9 - 7.987

-0.08699999999999974

When I run the numbers myself however I get 7.961 so 7.9 - 7.961 = -.061... <br />
The book officially agrees with -.061 but in the same answer originally has 7.987... Not sure whats going on here <br />
But my guess python is rounding at some points that my calculator isn't causing small changes...

### 10

In [138]:
dummy_data = [(10,10), (15,8), (16,6), (17,4)]

In [139]:
regression_line(dummy_data)

 ŷ = 17.98+-0.76x


In [154]:
correlation_coefficient(dummy_data)

-0.9116

#### A

In [146]:
new_dummy = []
for x,y in dummy_data:
    new_dummy.append((x,y*-1))

In [147]:
new_dummy

[(10, -10), (15, -8), (16, -6), (17, -4)]

In [148]:
regression_line(new_dummy)

 ŷ = -17.98+0.76x


In [149]:
correlation_coefficient(new_dummy)

0.9116

The correlation is now positive and increased and the regression line slope was inversed and will be positive

#### B

In [150]:
new_dummy = []
for x,y in dummy_data:
    new_dummy.append((y,x))

In [151]:
new_dummy

[(10, 10), (8, 15), (6, 16), (4, 17)]

In [152]:
regression_line(new_dummy)

 ŷ = 22.18+-1.1x


In [153]:
correlation_coefficient(new_dummy)

-0.9116

The line is still negative and the slope has decreased by quite a bit but correlation remains the same

### D

In [155]:
new_dummy = []
for x,y in dummy_data:
    new_dummy.append((x*-1,y*-1))

In [156]:
new_dummy

[(-10, -10), (-15, -8), (-16, -6), (-17, -4)]

In [157]:
regression_line(new_dummy)

 ŷ = -17.98+-0.76x


In [158]:
correlation_coefficient(new_dummy)

-0.9116

The Slope is the same but the y intercept is now inversed

### 11

We Need to find the correlation of determination and subtract it from one to find the % variability NOT explained by the regression of training tasks on dexterity

In [1]:
# b = r * (sy/sx), 2.7 = r *3.33, r= 2.7/3.33 = .81
#.81^2 equals .66 the correlation of determination
#1-.66 = .33

So 33% of the variability is not explained by the regression of x on y

### 12

I would think the correlation would increase and the slope would decrease

#### A
y = -.39+.118X

#### B
Square root of .974 = .986

#### C
y = -.39 +.118*(20) = 1.97 or 1,970

### 14

#### A

In [2]:
homes = [(110,15.7), (80,10), (95,12.7), (70,7.8), (55,10.4)]

In [18]:
regression_line(homes)

 ŷ = 1.88+0.12x


#### B

In [19]:
predict_y(homes,85)

11.66521144330469

#### C

In [20]:
correlation_coefficient(homes)

0.81965

#### D

I would have to use the mean of Y in this case 11.32%

### 15

I Would estimate that the slope would be near straight and positive since r= .9 which is a relatively large correlation. <br />
The book says that the slope would be r since the data is 'standardized' I'm guessing this means sy and sx are near or 1 exactly?

### 16

**A:** .6 <br/>
**B:** 0 <br />
**C:** -.95 <br />
**d:** -.2 (book says -.5)

### 17

Its clear their will be no residuals at all, as the regression line exactly matches the actual line

In [21]:
data = [(7,10), (8,11), (11,14), (12,15), (15,18)]

In [24]:
sum_of_squared_errors(data)

0.0

### 18

In [25]:
school_rating = [(40,10), (60,50), (70,60), (73,65), (75,75), (68,73), (65,78), (85,80), (98,90), (90,95)]

In [26]:
regression_line(school_rating)

 ŷ = -29.37+1.34x


#### A

In [27]:
correlation_coefficient(school_rating)

0.9047777777777777

In [29]:
coefficient_determination(school_rating)

(1.0, 0.8186228271604936)

Correlation != Causation but the correlation is enormous here and a good indicator that their may be a relationship

#### B

In [28]:
predict_y(school_rating,80)

77.77876672181986

### 19

A,b,e

### 20

#### A

7.1 +.35(12) = 11.3

#### B

If someone has zero right hand strength they are predicted to have 7.1 left hand strength (perhaps only have one hand?) <br />
For every 1 kg of right hand strength someone is predicted to have .35kg left hand strength

## Cumulative Review

### 1

A statistic is gathered from a sample of a population and a parameter is taken directly from an entire population

### 2

**False** For an interval of fixed length, there will be a greater proportion of the area under the normal curve if the interval is closer to the center than if it is removed from the center

### 3

In [30]:
data = [82.93,26,56,75,73,80,61,79,90,94,93,100,71,100,60]

In [37]:
from statistics_module import median

In [38]:
mean(data)

76.062

In [39]:
median(data)

79.0

Looks like the outlier is effecting the mean but the median is resistant to it

### 4

No, to make the mean negative most of the numbers would be so, but then the negative mean and x's would be squared nullifying any potentiallity for the stdv to be negative

### 5

In [40]:
import statistics_module

In [42]:
statistics_module.give_stats(data)

Median: 79.0
Mean: 76.06
Standard Deviation: 19.73
Q1: 61.0
Q3: 93.0
Interquartile Range: 32.0
Outliers Median Method: []
Outliers Mean Method Strong (3+SDV): []
Outliers Mean Method Weak (2SDV): [26]
Minimum Value: 26
Maximum Value: 100
