# 1. Drawing Lines
Instructions Plot the equation y=x−1
, using the existing x variable.
Plot the equation y=x+10
, using the existing x variable.
Hint
Remember to adjust the x variable and assign the result to y.

In [1]:
import matplotlib.pyplot as plt
import numpy as np

x = [0, 1, 2, 3, 4, 5]
# Going by our formula, every y value at a position is the same as the x-value in the same position.
# We could write y = x, but let's write them all out to make this more clear.
y = [0, 1, 2, 3, 4, 5]

# As you can see, this is a straight line that passes through the points (0,0), (1,1), (2,2), and so on.
plt.plot(x, y)
plt.show()

# Let's try a slightly more ambitious line.
# What if we did y = x + 1?
# We'll make x an array now, so we can add 1 to every element more easily.
x = np.asarray([0, 1, 2, 3, 4, 5])
y = x + 1

# y is the same as x, but every element has 1 added to it.
print(y)

# This plot passes through (0,1), (1,2), and so on.
# It's the same line as before, but shifted up 1 on the y-axis.
plt.plot(x, y)
plt.show()

# By adding 1 to the line, we moved what's called the y-intercept -- where the line intersects with the y-axis.
# Moving the intercept can shift the whole line up (or down when we subtract).

y = x - 1 

plt.plot(x, y)
plt.show()

y = x + 10 

plt.plot(x, y)
plt.show()

[1 2 3 4 5 6]


# 2. Working With Slope 
Instructions Plot the equation y=4x
, using the existing x variable.
Plot the equation y=.5x
, using the existing x variable.
Plot the equation y=−2x
, using the existing x variable.
Hint
Remember to multiply x by the right value to get y.

In [7]:
import matplotlib.pyplot as plt
import numpy as np

x = np.asarray([0, 1, 2, 3, 4, 5])

y = 2*x

plt.plot(x, y)
plt.show()

y = 4*x

plt.plot(x, y)
plt.show()

y = .5*x

plt.plot(x, y)
plt.show()

y = -2*x

plt.plot(x, y)
plt.show()

# 3. Starting Out With Linear Regression 
Instructions Calculate the slope you would need to predict the "quality" column (y) using the "density" column (x).
Assign the slope to slope_density.
Hint
Remember to compute the covariance with cov(x,y), and the variance with .var().
Remember that the cov function returns a matrix, and you have to index with [0, 1].

In [3]:
import pandas as pd

wine_quality = pd.read_table('winequality-red.csv',sep=';')
wine_quality.head(5)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25,67,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15,54,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17,60,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5


In [8]:
from numpy import cov
import matplotlib.pyplot as plt

slope_density = cov(wine_quality["density"], wine_quality["quality"])[0, 1] / wine_quality["quality"].var()
slope_density

-0.00040879580322391901

# 5. Finishing Linear Regression
Instructions
Calculate the y-intercept that you would need to predict the "quality" column (y) using the "density" column (x).
Assign the result to intercept_density.
Hint
wine_quality["quality"].mean() gives you the mean of the "quality" column.

In [10]:
from numpy import cov

def calc_slope(x, y):
    return cov(x, y)[0, 1] / x.var()
intercept_density = wine_quality["quality"].mean() - (calc_slope(wine_quality["density"], wine_quality["quality"]) * wine_quality["density"].mean())
intercept_density

80.2385380207906

# 6. Making Predictions
Instructions Write a function to compute the predicted y-value from a given x-value.
Use the .apply() method on the "density" column to apply the function to each item in the column. This will compute all the predicted y-values.
Assign the result to predicted_quality.
Hint
Compute the slope and intercept with the given functions.
Make a function that takes one number as an argument, and computes the predicted y value.
Use the .apply() method on the "density" column to compute the ratings.
    

In [13]:
from numpy import cov

def calc_slope(x, y):
    return cov(x, y)[0, 1] / x.var()

def calc_intercept(x, y, slope):
    return y.mean() - (slope * x.mean())
slope = calc_slope(wine_quality["density"], wine_quality["quality"])
intercept = calc_intercept(wine_quality["density"], wine_quality["quality"], slope)

def compute_predicted_y(x):
    return x * slope + intercept

predicted_quality = wine_quality["density"].apply(compute_predicted_y)
predicted_quality.head(3)

0    5.557186
1    5.632032
2    5.617062
Name: density, dtype: float64

# 7. Finding Error
Instructions
Using the given slope and intercept, calculate the predicted y values.
Subtract each predicted y value from the corresponding actual y value, square the difference, and add all the differences together.
This will give you the sum of squared residuals. Assign this value to rss.
Hint
You can compute the predicted y values by plugging the slope and intercept into the linear regression equation and using a list comprehension to go over all the values in the "density" column.
You can then compute the residuals by subtracting each y value from each predicted y value, and squaring the result.
Add these squared residuals to get the sum.


In [14]:
from scipy.stats import linregress

slope, intercept, r_value, p_value, stderr_slope = linregress(wine_quality["density"], wine_quality["quality"])

print(slope)
print(intercept)

import numpy
predicted_y = numpy.asarray([slope * x + intercept for x in wine_quality["density"]])
residuals = (wine_quality["quality"] - predicted_y) ** 2
rss = sum(residuals)

-74.8460136015
80.2385380208


# 8. Standard Error
Instructions Calculate the standard error using the above formula.
Calculate what percentage of actual y values are within 1 standard error of the predicted y value. Assign the result to within_one.
Calculate what percentage of actual y values are within 2 standard errors of the predicted y value. Assign the result to within_two.
Calculate what percentage of actual y values are within 3 standard errors of the predicted y value. Assign the result to within_three.
Assume that "within" means "up to and including", so be sure to count values that are exactly 1, 2, or 3 standard errors away.
Hint
The standard error can be calculated with (rss / (len(wine_quality["quality"]) - 2)) ** .5.
To find the percentage of actual y values within 1 standard error of the predicted y value, first take the difference between each actual y and each predicted y value.
Then, take the absolute value of the differences. Then, divide by the standard error. Any value that is less than or equal to the standard error you want falls within.

In [15]:
from scipy.stats import linregress

slope, intercept, r_value, p_value, stderr_slope = linregress(wine_quality["density"], wine_quality["quality"])

print(slope)
print(intercept)

import numpy
predicted_y = numpy.asarray([slope * x + intercept for x in wine_quality["density"]])
residuals = (wine_quality["quality"] - predicted_y) ** 2
rss = sum(residuals)

stderr = (rss / (len(wine_quality["quality"]) - 2)) ** .5

def within_percentage(y, predicted_y, stderr, error_count):
    within = stderr * error_count
    
    differences = abs(predicted_y - y)
    lower_differences = [d for d in differences if d <= within]
    
    within_count = len(lower_differences)
    return within_count / len(y)

within_one = within_percentage(wine_quality["quality"], predicted_y, stderr, 1)
within_two = within_percentage(wine_quality["quality"], predicted_y, stderr, 2)
within_three = within_percentage(wine_quality["quality"], predicted_y, stderr, 3)

-74.8460136015
80.2385380208
