
# Lab 8 Validation Notebook

You cannot write any code in this notebook – to actually work on the lab, open `lab.py` and `lab.ipynb`.

The only purpose of this notebook is to give you a blank copy of `lab.ipynb`,
so that you can go to **Kernel > Restart & Run All** and ensure that all public `grader.check` cells pass using just the code in your `lab.py`.

**Before submitting Lab 8, make sure that the call to `grader.check_all()` at the bottom of this notebook shows that all test cases passed!** 
If it doesn't, there's likely a function in `lab.ipynb` that is not implemented correctly in `lab.py`, or it could be that a function in `lab.py` depends on an object (e.g. a DataFrame) that is not an argument to that function.
    

In [1]:
# Initialize Otter
import otter
grader = otter.Notebook("lab.ipynb")

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
from lab import *

In [4]:
import pandas as pd
import numpy as np
import plotly.express as px
import statsmodels.api as sm
from pathlib import Path
from sklearn.preprocessing import Binarizer, QuantileTransformer, FunctionTransformer

import warnings
warnings.filterwarnings('ignore')

In [5]:
# By setting a seed, we guarantee that we will see the same results each time we run this cell.
np.random.seed(23)

# Generates a random scatter plot
x = np.arange(1, 101) + np.random.normal(0, 0.5, 100)
y = 2 * ((x + np.random.normal(0, 1, 100)) ** 2) + np.abs(x) * np.random.normal(0, 30, 100)
df_1 = pd.DataFrame().assign(x=x, y=y)

px.scatter(df_1, x='x', y='y', trendline="ols", trendline_color_override="red")

In [6]:
df_1['root y'] = np.sqrt(df_1['y'])

px.scatter(df_1, x='x', y='root y', trendline="ols", trendline_color_override="red")

In [7]:
# By setting a seed, we guarantee that we will see the same results each time we run this cell
np.random.seed(32)

# Generates a different random scatter plot
x = np.linspace(2, 5, 100)
y = 10 * (np.e ** x) + np.abs(x) * np.random.normal(0, 5, 100) + np.random.normal(0, 30, 100)
df_2 = pd.DataFrame().assign(x=x, y=y)

px.scatter(df_2, x='x', y='y', trendline="ols", trendline_color_override="red")

In [8]:
df_2['root y'] = np.sqrt(df_2['y'])

px.scatter(df_2, x='x', y='root y', trendline="ols", trendline_color_override="red")

In [9]:
# Feel free to use this function directly to help you answer Question 1.
def create_residual_plot(df, x, y):
    df = df.copy()
    from sklearn.linear_model import LinearRegression
    model = LinearRegression()
    model.fit(df[[x]], df[y])
    df['pred'] = model.predict(df[[x]])
    df[f'{y} residuals'] = df[y] - model.predict(df[[x]])
    return px.scatter(df, x='pred', y=f'{y} residuals', trendline='ols', trendline_color_override='red')

create_residual_plot(df_2, 'x', 'root y')

In [10]:
df_2['log y'] = np.log(df_2['y'])

px.scatter(df_2, x='x', y='log y', trendline="ols", trendline_color_override="red")

In [11]:
create_residual_plot(df_2, 'x', 'log y')

In [12]:
homeruns_fp = Path('data')/'homeruns.csv'
homeruns = pd.read_csv(homeruns_fp)

In [13]:
grader.check("q1")

In [14]:
diamonds = pd.read_csv(Path('data')/'diamonds.csv')
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [15]:
# don't change this cell, but do run it -- it is needed for the tests to work
diamonds = pd.read_csv(Path('data')/'diamonds.csv')
out_q2 = create_ordinal(diamonds)

In [16]:
grader.check("q2")

In [17]:
# don't change this cell, but do run it -- it is needed for the tests to work
diamonds = pd.read_csv(Path('data')/'diamonds.csv')
out1_q3 = create_one_hot(diamonds)
out2_q3 = create_proportions(diamonds)

In [18]:
grader.check("q3")

In [19]:
# don't change this cell, but do run it -- it is needed for the tests to work
diamonds = pd.read_csv(Path('data')/'diamonds.csv')
out_q4 = create_quadratics(diamonds)

In [20]:
grader.check("q4")

In [21]:
from sklearn.linear_model import LinearRegression

# X = ...
# y = ...

# lr = LinearRegression()
# lr.fit(X, y)  # X is a DataFrame of training data; y is a Series of prices
# lr.score(X, y)  # R-squared
# lr.predict(X) # predicted prices

In [22]:
# don't change this cell, but do run it -- it is needed for the tests to work
import numbers
out_q5 = comparing_performance()

In [23]:
grader.check("q5")

In [24]:
from sklearn.preprocessing import Binarizer, QuantileTransformer, FunctionTransformer

In [25]:
# don't change this cell, but do run it -- it is needed for the tests to work
diamonds = pd.read_csv(Path('data')/'diamonds.csv')
q6a_trans = TransformDiamonds(diamonds)
q6a_out = q6a_trans.transform_carat(diamonds)

In [26]:
grader.check("q6.1")

In [27]:
# don't change this cell, but do run it -- it is needed for the tests to work
q6b_trans = TransformDiamonds(diamonds)
q6b_out = q6b_trans.transform_to_quantile(diamonds)
q6b_trans_top_1000 = TransformDiamonds(diamonds[:1000])
q6b_out_top_1000 = q6b_trans_top_1000.transform_to_quantile(diamonds)

In [28]:
grader.check("q6.2")

In [29]:
# don't change this cell, but do run it -- it is needed for the tests to work
diamonds = pd.read_csv(Path('data')/'diamonds.csv')
q6c_trans = TransformDiamonds(diamonds)
q6c_out = q6c_trans.transform_to_depth_pct(diamonds)

In [30]:
grader.check("q6.3")

In [31]:
grader.check_all()

q1 results: All test cases passed!

q2 results: All test cases passed!

q3 results: All test cases passed!

q4 results: All test cases passed!

q5 results: All test cases passed!

q6.1 results: All test cases passed!

q6.2 results: All test cases passed!

q6.3 results: All test cases passed!


If you were able to go to **Kernel > Restart & Run All** and see all test cases pass above, and you've thoroughly tested your code yourself, you're ready to submit `lab.py` to Gradescope!
    