# ASSOCIATIONS: TWO QUANTITATIVE VARIABLES
## Introduction
When associations exist between variables, it means that information about the value of one variable gives us information about the value of the other variable. In this lesson, we will cover ways of examining an association between two quantitative variables.


Throughout the next few exercises, we’ll examine some data about Texas housing rentals on Craigslist — an online classifieds site. The data dictionary is as follows:


- price: monthly rental price in U.S.D.
- type: type of housing (eg., 'apartment', 'house', 'condo', etc.)
- sqfeet: housing area, in square feet
- beds: number of beds
- baths: number of baths
- lat: latitude
- long: longitude


Except for type, all of these variables are quantitative. Which pairs of variables do you think might be associated? For example, does knowing something about price give you any information about square footage?


## Instructions
##### 1. The dataset described above has been saved for you as a pandas dataframe named housing. Use the .head() method to print the first 10 rows and inspect some more of the data. What are some other quantitative variables that might be related to each other?


`Hint` <br>
`Use the .head() method as follows:`

In [None]:
housing.head(10)

In [None]:
import pandas as pd
import codecademylib3


housing = pd.read_csv('housing_sample.csv')

#print the first 10 rows of data:]
print(housing.head(10))

# Scatter Plots
One of the best ways to quickly visualize the relationship between quantitative variables is to plot them against each other in a scatter plot. This makes it easy to look for patterns or trends in the data. Let’s start by plotting the area of a rental against it’s monthly price to see if we can spot any patterns.

In [None]:
plt.scatter(x = housing.price, y = housing.sqfeet)
plt.xlabel('Rental Price (USD)')
plt.ylabel('Area (Square Feet)')
plt.show()

![](https://static-assets.codecademy.com/Courses/Hypothesis-Testing/price_v_area.svg)

While there’s a lot of variation in the data, it seems like more expensive housing tends to come with slightly more space. This suggests an association between these two variables.

It’s important to note that different kinds of associations can lead to different patterns in a scatter plot. For example, the following plot shows the relationship between the age of a child in months and their weight in pounds. We can see that older children tend to weigh more but that the growth rate starts leveling off after 36 months:

![](https://static-assets.codecademy.com/Courses/Hypothesis-Testing/weightvage.svg)

If we don’t see any patterns in a scatter plot, we can probably guess that the variables are not associated. For example, a scatter plot like this would suggest no association:

![](https://static-assets.codecademy.com/Courses/Hypothesis-Testing/no_association.svg)

## Instructions
##### 1. The housing data has been saved for you as a dataframe named housing in script.py. Create a scatter plot to see if there is an association between the area (sqfeet) of a rental and the number of bedrooms (beds). Do you think these variables are associated? If so, is the relationship what you expected?

`Hint` <br>
`Fill in the following code:`

In [None]:
plt.scatter(x = housing.beds, y = housing.___)
plt.xlabel('Number of beds')
plt.ylabel('Number of sqfeet')
plt.show()

In [None]:
import pandas as pd
import matplotlib.pyplot as plt 
import codecademylib3

housing = pd.read_csv('housing_sample.csv')

print(housing.head())

#create your scatter plot here:
plt.scatter(housing.beds, housing.sqfeet)
plt.xlabel('Number of Beds')
plt.ylabel('Area (Square Feet)')
plt.show()

# Exploring Covariance
Beyond visualizing relationships, we can also use summary statistics to quantify the strength of certain associations. Covariance is a summary statistic that describes the strength of a linear relationship. A linear relationship is one where a straight line would best describe the pattern of points in a scatter plot.

Covariance can range from negative infinity to positive infinity. A positive covariance indicates that a larger value of one variable is associated with a larger value of the other. A negative covariance indicates a larger value of one variable is associated with a smaller value of the other. A covariance of 0 indicates no linear relationship. Here are some examples:

![](https://static-assets.codecademy.com/Courses/Hypothesis-Testing/covariance_fig2.svg)

To calculate covariance, we can use the cov() function from NumPy, which produces a covariance matrix for two or more variables. A covariance matrix for two variables looks something like this:

---|variable 1|variable 2
---|---|---
variable 1|variance(variable 1)|covariance
variable 2|covariance|variance(variable 2)

In python, we can calculate this matrix as follows:

In [None]:
cov_mat_price_sqfeet = np.cov(housing.price, housing.sqfeet)
print(cov_mat_price_sqfeet)
#output: 
[[184332.9  57336.2]
 [ 57336.2 122045.2]]

Notice that the covariance appears twice in this matrix and is equal to 57336.2.

## Instructions
##### 1. Use the cov() function from NumPy to calculate the covariance matrix for the sqfeet variable and the beds variable. Save the covariance matrix as cov_mat_sqfeet_beds and print it out.


`Hint` <br>
`Use the np.cov() function to calculate the covariance. Pass in the dataframe columns, housing.sqfeet and housing.beds. Save the result as cov_mat_sqfeet_beds.`

##### 2. Print out the value stored in the variable cov_mat_sqfeet_beds.


`Hint` <br>
`Use the print() to print the value stored in the variable.`

##### 3. Look at the covariance matrix you just printed and find the covariance of sqfeet and beds. Save that number as a variable named cov_sqfeet_beds.


`Hint` <br>
`Hint: Remember that the covariance matrix can be read as follows:`

---|variable 1|variable 2
---|---|---
variable 1|variance(variable 1)|covariance
variable 2|covariance|variance(variable 2)

In [None]:
import numpy as np
import pandas as pd
np.set_printoptions(suppress=True, precision = 1) 

housing = pd.read_csv('housing_sample.csv')

# calculate and print covariance matrix:
cov_mat_sqfeet_beds = np.cov(housing.beds, housing.sqfeet)
print(cov_mat_sqfeet_beds)

# store the covariance as cov_sqfeet_beds
cov_sqfeet_beds = 228.2

# Correlation- Part 1


Like covariance, Pearson Correlation (often referred to simply as “correlation”) is a scaled form of covariance. It also measures the strength of a linear relationship, but ranges from -1 to +1, making it more interpretable.


Highly associated variables with a positive linear relationship will have a correlation close to 1. Highly associated variables with a negative linear relationship will have a correlation close to -1. Variables that do not have a linear association (or a linear association with a slope of zero) will have correlations close to 0.

![](https://static-assets.codecademy.com/Courses/Hypothesis-Testing/correlation_fig_1-3.svg)

The pearsonr() function from scipy.stats can be used to calculate correlation as follows:

In [None]:
from scipy.stats import pearsonr
corr_price_sqfeet, p = pearsonr(housing.price, housing.sqfeet)
print(corr_price_sqfeet) #output: 0.507

Generally, a correlation larger than about .3 indicates a linear association. A correlation greater than about .6 suggestions a strong linear association.

## Instructions
##### 1. Use the pearsonr function from scipy.stats to calculate the correlation between sqfeet and beds. Store the result in a variable named corr_sqfeet_beds and print out the result. How strong is the linear association between these variables?


`Hint` <br>
`Fill in the following code`

In [None]:
corr_sqfeet_beds, p = pearsonr(housing.sqfeet, ___)
print(corr_sqfeet_beds)

##### 2. Generate a scatter plot of sqfeet and beds again. Does the correlation value make sense?


`Hint`
`Fill in the following code:`

In [None]:
plt.xlabel('Number of beds')
plt.ylabel('Number of sqfeet')
plt.scatter(x = housing.beds, y = housing.___)
plt.show()

In [None]:
import pandas as pd
import matplotlib.pyplot as plt 
import codecademylib3
from scipy.stats import pearsonr

housing = pd.read_csv('housing_sample.csv')

# calculate corr_sqfeet_beds and print it out:
corr_sqfeet_beds, p = pearsonr(housing.sqfeet, housing.beds)
print(corr_sqfeet_beds)

# create the scatter plot here:
plt.scatter(housing.beds, housing.sqfeet)
plt.xlabel('Number of Beds')
plt.ylabel('Area (Square Feet)')
plt.show()

# Correlation Part 2
It’s important to note that there are some limitations to using correlation or covariance as a way of assessing whether there is an association between two variables. Because correlation and covariance both measure the strength of linear relationships with non-zero slopes, but not other kinds of relationships, correlation can be misleading.

For example, the four scatter plots below all show pairs of variables with near-zero correlations. The bottom left image shows an example of a perfect linear association where the slope is zero (the line is horizontal). Meanwhile, the other three plots show non-linear relationships — if we drew a line through any of these sets of points, that line would need to be curved, not straight!

![](https://static-assets.codecademy.com/Courses/Hypothesis-Testing/correlation_fig_2.svg)

Instructions
1.
A simulated dataset named sleep has been loaded for you in script.py. The hypothetical data contains two columns:

hours_sleep: the number of hours that a person slept
performance: that person’s performance score on a physical task the next day
Create a scatter plot of hours_sleep (on the x-axis) and performance (on the y-axis). What is the relationship between these variables?


`Hint` <br>
`Fill in the following code:`

In [None]:
plt.scatter(sleep.___, sleep.___)

`More sleep appears to be associated with higher performance, up to about eight hours, after which more sleep is associated with poorer performance.`

##### 2. Calculate the correlation for hours_sleep and performance and save the result as corr_sleep_performance. Then, print out corr_sleep_performance. Does the correlation accurately reflect the strength of the relationship between these variables?


`Hint` <br>
`The correlation is only 0.28 (a relatively small correlation), even though the variables seem to be clearly associated (there is a very clear pattern in the scatter plot).`

In [None]:
import pandas as pd
import matplotlib.pyplot as plt 
import codecademylib3
from scipy.stats import pearsonr

sleep = pd.read_csv('sleep_performance.csv')

# create your scatter plot here:
plt.scatter(sleep.hours_sleep, sleep.performance)
plt.xlabel('Number of Beds')
plt.ylabel('Area (Square Feet)')
plt.show()

# calculate the correlation for `hours_sleep` and `performance`:
corr_sleep_performance, p = pearsonr(sleep.hours_sleep, sleep.performance)
print(corr_sleep_performance)

# Review
In this lesson we discussed several ways of examining an association between two quantitative variables. More specifically, we:

- Used scatter plots to examine relationships between quantitative variables
- Used covariance and correlation to quantify the strength of a linear relationship between two quantitative variables
Note that the dataset used in this lesson was downloaded from kaggle.

Instructions
As a final exercise, a new dataset named penguins has been uploaded for you in script.py. This dataset contains various measurements for a sample of penguins. To practice the skills learned in this lesson, here are some things to try:

- Inspect the first few rows of data.
- Create a scatter plot of flipper length (flipper_length_mm) and body mass (body_mass_g).
- Inspect your plot. What is the relationship between these variables?
- Calculate the covariance for these two variables.
- Calculate the correlation for these two variables. Does this number make sense given the plot you created?
Solution code is available to you in solution.py if you want to compare your work.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
import codecademylib3
from scipy.stats import pearsonr
np.set_printoptions(suppress=True, precision = 1) 

penguins = pd.read_csv('penguins.csv')

#print the first few rows
print(penguins.head())

#create a scatter plot
plt.scatter(penguins.flipper_length_mm, penguins.body_mass_g)
plt.show()

#calculate covariance:
covariance_mat = np.cov(penguins.flipper_length_mm, penguins.body_mass_g)
print("covariance matrix: ")
print(covariance_mat)

print("covariance: ", covariance_mat[1][0])

#calculate correlation:
correlation, p = pearsonr(penguins.flipper_length_mm, penguins.body_mass_g)
print("correlation: ", correlation)