# Data Science  - Unit 1 Sprint 3 Module 4

## Module Project: Metrics, Bias and Variance

### Learning Objectives

* Interpret your model results using OLS and Sklearn metrics
* Define and analyze bias in your model
* Define and analyze variance in your model

## Analyzing results from diamonds

Use the seaborn dataset `diamonds` to run a linear regression model and produce the common metrics you would use to evaluate your model's accuracy.

**Task 1** - Load the data
Load the `diamonds` dataset from the `seaborn` package.

- Assign the value to an object called `dia`
- Make sure to import the packages you expect to use for an `ols` linear regression model.

In [None]:
#Task 1

#imports
from statsmodels.formula.api import ols
import seaborn as sns
from sklearn import metrics
import numpy as np
import pandas as pd
from scipy import stats
# YOUR CODE HERE

sns.get_dataset_names()
dia = sns.load_dataset('diamonds')
dia.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [None]:
# Task 1 - Tests

assert isinstance(dia, pd.DataFrame)


**Task 2** - Conduct EDA on your dataset
- Check for null values. Assign the total number of null values in your dataset to `num_null`

In [None]:
#Task 2
# YOUR CODE HERE
num_null = dia.isnull().sum().sum()
sns.displot(dia['price'])

In [None]:
num_null

0

**Task 3** - Visualize your feature distributions

- Use seaborn's `pairplot`to visualize the distributions for all your dataset's features.
- You can access the documentation [here](https://seaborn.pydata.org/generated/seaborn.pairplot.html)
- This next task will not be autograded.

**Task 5** Check for multicollinearity

- Determine the `pearson` correlations for the `x`, `y`, and `z` columns to `carat`.
- Assign the value of the correlations to `x_corr`, `y_corr` and `z_corr` respectively.

In [None]:
dia['carat']

0        0.23
1        0.21
2        0.23
3        0.29
4        0.31
         ... 
53935    0.72
53936    0.72
53937    0.70
53938    0.86
53939    0.75
Name: carat, Length: 53940, dtype: float64

In [None]:
#Task 5

# YOUR CODE HERE
x_corr, y_corr, z_corr = dia.corr().loc['carat', ['x','y', 'z']]


print(x_corr, y_corr, z_corr)

0.9750942267264254 0.9517221990129883 0.9533873805614275


  x_corr, y_corr, z_corr = dia.corr().loc['carat', ['x','y', 'z']]


**Task 6**


Because these three columns share a great deal of correlation with the `carat` feature, it does not make sense to use them as part of our model. Drop the three columns and reassign to the `dia` dataframe.

In [None]:
#Task 6

# YOUR CODE HERE
dia = dia.drop(columns = ['x', 'y', 'z'])

Unnamed: 0,carat,cut,color,clarity,depth,table,price,y_pred,residuals
0,0.23,Ideal,E,SI2,61.5,55.0,326,-472.382688,798.382688
1,0.21,Premium,E,SI1,59.8,61.0,326,-627.511200,953.511200
2,0.23,Good,E,VS1,56.9,65.0,327,-472.382688,799.382688
3,0.29,Premium,I,VS2,62.4,58.0,334,-6.997151,340.997151
4,0.31,Good,J,SI2,63.3,58.0,335,148.131362,186.868638
...,...,...,...,...,...,...,...,...,...
53935,0.72,Ideal,D,SI1,60.8,57.0,2757,3328.265865,-571.265865
53936,0.72,Good,D,SI1,63.1,55.0,2757,3328.265865,-571.265865
53937,0.70,Very Good,D,SI1,62.8,60.0,2757,3173.137353,-416.137353
53938,0.86,Premium,H,SI2,61.0,58.0,2757,4414.165451,-1657.165451


**Task 7** - OLS Modeling

- Use `carat` as your independent feature.
- Use the `price` values as your dependent features.
- Build an OLS model and review the summary report. Make sure to assign a variable called `model`

In [None]:
#Task 7

# YOUR CODE HERE
model = ols('price ~ carat', dia).fit()

print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.849
Model:                            OLS   Adj. R-squared:                  0.849
Method:                 Least Squares   F-statistic:                 3.041e+05
Date:                Thu, 27 Apr 2023   Prob (F-statistic):               0.00
Time:                        03:20:36   Log-Likelihood:            -4.7273e+05
No. Observations:               53940   AIC:                         9.455e+05
Df Residuals:                   53938   BIC:                         9.455e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept  -2256.3606     13.055   -172.830      0.0

In [None]:
#Task 7 - Test




**Task 8** - Predictions and Residuals

- Create a new column that includes your model predictions for your features. Name the column `y_pred`
- Calculate the prediction residuals. Assign the values to a column named `residuals`.

In [None]:
#Task 8

# YOUR CODE HERE
dia['y_pred'] = model.predict()
dia['residuals'] = dia['price'] - dia['y_pred']

dia.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z,y_pred,residuals
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43,-472.382688,798.382688
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31,-627.5112,953.5112
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31,-472.382688,799.382688
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63,-6.997151,340.997151
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75,148.131362,186.868638


In [None]:
dia.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,y_pred,residuals
0,0.23,Ideal,E,SI2,61.5,55.0,326,-472.382688,798.382688
1,0.21,Premium,E,SI1,59.8,61.0,326,-627.5112,953.5112
2,0.23,Good,E,VS1,56.9,65.0,327,-472.382688,799.382688
3,0.29,Premium,I,VS2,62.4,58.0,334,-6.997151,340.997151
4,0.31,Good,J,SI2,63.3,58.0,335,148.131362,186.868638


In [None]:
#Task 8 - Test

assert dia.shape == (53940, 9), "Have you created the two columns?"


**Task 9** - Metrics

- Determine the values for the **mean absolute error, the mean squared error** and the **root mean squared error** for your previous model.
- Assign the values as `mae`, `mse`, and `rmse` respectively.
- *Hint*: We discussed a few methods for this in class. You can refer to this [documentation](https://scikit-learn.org/stable/modules/model_evaluation.html) for other metric values.

In [None]:
#Task 9

# YOUR CODE HERE
mae = metrics.mean_absolute_error(dia['price'], dia['y_pred'])
mse = metrics.mean_squared_error(dia['price'], dia['y_pred'])
rmse = np.sqrt(mse)
print(mae, mse, rmse)

3932.0017821653687 31372487.13610113 5601.114811901389


**Task 10** - OLS Modeling, Addtional Features

- Use the `depth`, `table`, and `carat` as your independent features.
- Use the `price` values as your dependent features.
- Build an OLS model and review the summary report. Make sure to assign a variable called `model`.  

In [None]:
dia.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,y_pred,residuals
0,0.23,Ideal,E,SI2,61.5,55.0,326,-472.382688,798.382688
1,0.21,Premium,E,SI1,59.8,61.0,326,-627.5112,953.5112
2,0.23,Good,E,VS1,56.9,65.0,327,-472.382688,799.382688
3,0.29,Premium,I,VS2,62.4,58.0,334,-6.997151,340.997151
4,0.31,Good,J,SI2,63.3,58.0,335,148.131362,186.868638


In [None]:
#Task 10

# YOUR CODE HERE
model = ols('price ~ depth + table + carat', dia).fit()

print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.854
Model:                            OLS   Adj. R-squared:                  0.854
Method:                 Least Squares   F-statistic:                 1.049e+05
Date:                Thu, 27 Apr 2023   Prob (F-statistic):               0.00
Time:                        03:37:41   Log-Likelihood:            -4.7194e+05
No. Observations:               53940   AIC:                         9.439e+05
Df Residuals:                   53936   BIC:                         9.439e+05
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     1.3e+04    390.918     33.264      0.0

In [None]:
#Task 10 - Test


assert len(model.params.index) == 4, "Make sure you've assigned both values."


**Task 11** - Predictions and Residuals

- Create a new column that includes your model predictions for your features. Name the column `y_pred`
- Calculate the residuals. Assign the values to a column named `residuals`.


In [None]:
#Task 11

# YOUR CODE HERE
dia['y_pred'] = model.predict()
dia['residuals'] = dia['price'] - dia['y_pred']
dia.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,y_pred,residuals
0,0.23,Ideal,E,SI2,61.5,55.0,326,-236.080501,562.080501
1,0.21,Premium,E,SI1,59.8,61.0,326,-762.990802,1088.990802
2,0.23,Good,E,VS1,56.9,65.0,327,-585.121107,912.121107
3,0.29,Premium,I,VS2,62.4,58.0,334,-214.085323,548.085323
4,0.31,Good,J,SI2,63.3,58.0,335,-193.022625,528.022625


In [None]:
#Task 11- Test

assert dia.shape == (53940, 9), "Have you created the two columns?"


**Task 12** - Metrics

- Determine the values for the **mean absolute error, the mean squared error** and the **root mean squared error** for your previous model.
- Assign the values as `mae`, `mse`, and `rmse` respectively.
- *Hint*: We discussed a few methods for this in class. You can refer to this [documentation](https://scikit-learn.org/stable/modules/model_evaluation.html) for other metric values.

In [None]:
#Task 12

# YOUR CODE HERE
mae = metrics.mean_absolute_error(dia['price'], dia['y_pred'])
mse = metrics.mean_squared_error(dia['price'], dia['y_pred'])
rmse = np.sqrt(mse)
print(mae, mse, rmse)


In [None]:
#Task 12 - Test
