In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

## Dataset: beer_foam.csv

Source: J.J. Hackbarth (2006). "Multivariate Analyses of Beer Foam Stand,"
Journal of the Institute of Brewing, Vol. 112, #1, pp. 17-24

Description: Measurements of wet foam height and beer height at various
time points for Shiner Bock at 20C. Author fits exponential decay model:
H(t) = H(0)*exp(-lambda*t)

Variables/Columns
TIME: Time from pour (seconds)  4-8
FOAM: Wet foam height (cm)  10-16
BEER: Beer height (cm)    18-24

**Hypothesis**: Can we predict the time from pour using the measurements of foam height and beer height?


In [2]:
# Read the csv file into a pandas DataFrame

coral = pd.read_csv('coralsyearcount.csv')
coral

Unnamed: 0,year,total
0,2018,22274
1,2017,53033
2,2016,32030
3,2015,34057
4,2014,41286
...,...,...
139,1868,63
140,1866,1
141,1861,2
142,1851,1


In [3]:
# Assign the data to X and y

X = coral[["year"]]
y = coral["total"].values.reshape(-1, 1)
print(X.shape, y.shape)

(144,) (144, 1)


In [4]:
# Use train_test_split to create training and testing data

### BEGIN SOLUTION
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

### END SOLUTION

In [5]:
# Create the model using LinearRegression

### BEGIN SOLUTION
from sklearn.linear_model import LinearRegression
model = LinearRegression()
### END SOLUTION

In [6]:
# Fit the model to the training data and calculate the scores for the training and testing data

### BEGIN SOLUTION
model.fit(X_train, y_train)
training_score = model.score(X_train, y_train)
testing_score = model.score(X_test, y_test)

### END SOLUTION 

print(f"Training Score: {training_score}")
print(f"Testing Score: {testing_score}")

ValueError: Expected 2D array, got 1D array instead:
array=[1898 1976 1890 1881 1914 1958 2018 1931 1990 1978 1927 1994 1993 1995
 1894 1951 1974 2003 1915 1979 1996 1950 1935 1866 1971 1988 1897 1939
 1871 1985 1934 1851 1956 1868 1873 1965 2013 1929 1900 1969 1983 1938
 1984 1877 2011 1975 1948 1924 1932 1885 1879 2010 2005 1901 1895 2015
 2001 1980 1946 1888 2012 1908 2016 1886 1964 1968 1905 1972 1904 1957
 1861 1936 1872 1922 1977 1960 1923 1970 1925 1882 1997 1961 1940 1986
 1869 1959 1955 1880 1981 1989 1902 1874 2017 1966 1878 1906 1911 1893
 1926 1941 1887 1842 1998 1947 1903 2004 1920 1907].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [None]:
# Plot the Residuals for the Training and Testing data

### BEGIN SOLUTION
plt.scatter(model.predict(X_train), model.predict(X_train) - y_train, c="blue", label="Training Data")
plt.scatter(model.predict(X_test), model.predict(X_test) - y_test, c="orange", label="Testing Data")
plt.legend()
plt.hlines(y=0, xmin=y.min(), xmax=y.max())
plt.title("Residual Plot")
### END SOLUTION