## Codio Activity 7.7: Using Non-Numeric Features

**Expected Time = 90 minutes**

**Total Points = 40**

This activity focuses on making use of features that are categorical.  In the example of the tips dataset, the day column was initially a string (or object). Through the process of dummy encoding the feature the resulting data representations can be used in a regression model.  

In this activity, you will explore the dummy encoding process to build and compare different regression models.  Specifically, you will use the sklearn estimators `LinearRegression` and `HuberRegressor` to fit your models.  These two models implement the mean squared error and huber loss functions, returning parameters that minimize the respective loss. 

## Index:

- [Problem 1](#Problem-1)
- [Problem 2](#Problem-2)
- [Problem 3](#Problem-3)
- [Problem 4](#Problem-4)
- [Problem 5](#Problem-5)
- [Problem 6](#Problem-6)
- [Problem 7](#Problem-8)
- [Problem 8](#Problem-8)
- [Problem 9](#Problem-9)
- [Problem 10](#Problem-10)

In [1]:
import plotly.express as px
import numpy as np
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression, HuberRegressor
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')

### The Dataset

The `diamonds` dataset from seaborn is loaded and displayed below.  You will explore models that use both the `cut` and `color` features independently, and models using all possible features.  To begin, you will use pandas `get_dummies` function to produce the dummy encoded data.  Your dummy encoded data should have as many features as there are unique values in the data.

In [2]:
diamonds = sns.load_dataset('diamonds')

In [3]:
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [4]:
len(diamonds['cut'].unique().tolist())

5

In [5]:
len(diamonds['color'].unique().tolist())

7

[Back to top](#Index:) 

## Problem 1

### Unique Values in `cut` and `color`

**4 Points**

Using the `cut` and `color` columns, determine the number of unique values in each column.  Assign the number of unique values in each feature as integers to `num_cuts` and `num_color` below.  

In [6]:
### GRADED

num_cuts = ''
num_color = ''

### BEGIN SOLUTION
num_cuts = 5
num_color = 7
### END SOLUTION

# Answer check
print(num_cuts)
print(num_color)

5
7


In [7]:
### BEGIN HIDDEN TESTS
num_cuts_ = len(diamonds['cut'].unique().tolist())
num_color_ = len(diamonds['color'].unique().tolist())
#
#
#
assert type(num_cuts_) == type(num_cuts)
assert num_cuts == num_cuts_
assert num_color == num_color_
### END HIDDEN TESTS

[Back to top](#Index:) 

## Problem 2

### Encoding the `cut` column

**4 Points**

Create a dummy encoded version of the `cut` column.  Assign your encoded data as a DataFrame to the variable `cut_encoded` below.  

In [8]:
### GRADED

cut_encoded = ''

### BEGIN SOLUTION
cut_encoded = pd.get_dummies(diamonds[['cut']])
### END SOLUTION

# Answer check
print(cut_encoded.shape)
print(type(cut_encoded))
cut_encoded.head()

(53940, 5)
<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,cut_Ideal,cut_Premium,cut_Very Good,cut_Good,cut_Fair
0,1,0,0,0,0
1,0,1,0,0,0
2,0,0,0,1,0
3,0,1,0,0,0
4,0,0,0,1,0


In [9]:
### BEGIN HIDDEN TESTS
cut_encoded_ = pd.get_dummies(diamonds[['cut']])
#
#
#
assert type(cut_encoded) == type(cut_encoded_)
pd.testing.assert_frame_equal(cut_encoded, cut_encoded_)
### END HIDDEN TESTS

[Back to top](#Index:) 

## Problem 3

### A Regression model on `cut`

**4 Points**

Build a regression model using the dummy encoded version of the `cut` column to predict the `price` column.  Use the `LinearRegression` estimator and assign the model to `cut_linreg` below.  Be sure to set `fit_intercept = False`. 

In [10]:
### GRADED

X = ''
y = ''
cut_linreg = ''

### BEGIN SOLUTION
X = pd.get_dummies(diamonds[['cut']])
y = diamonds['price']
cut_linreg = LinearRegression(fit_intercept=False).fit(X, y)
### END SOLUTION

# Answer check
print(cut_linreg)
print(type(cut_linreg))
cut_linreg.coef_

LinearRegression(fit_intercept=False)
<class 'sklearn.linear_model._base.LinearRegression'>


array([3457.54197021, 4584.2577043 , 3981.75989075, 3928.86445169,
       4358.75776398])

In [11]:
### BEGIN HIDDEN TESTS
X_ = pd.get_dummies(diamonds[['cut']])
y_ = diamonds['price']
cut_linreg_ = LinearRegression(fit_intercept=False).fit(X_, y_)
intercept_ = cut_linreg_.fit_intercept
coefs_ = cut_linreg_.coef_
coefs = cut_linreg.coef_
intercept = cut_linreg.fit_intercept
#
#
#
assert intercept == intercept_, 'Make sure to set fit_intercept = False'
np.testing.assert_array_almost_equal(coefs_, coefs, err_msg='Your coefficients are different.')
### END HIDDEN TESTS

[Back to top](#Index:) 

## Problem 4

### Interpreting the results

**4 Points**

Compare the coefficients of the model.  Which cut does your model predict as the price for a diamond with an `ideal_cut`?  Assign your solution as a float rounded to two decimal places to `ideal_cut_prediction` below.  

In [12]:
### GRADED

ideal_cut_prediction = ''

### BEGIN SOLUTION
X = pd.get_dummies(diamonds[['cut']])
y = diamonds['price']
cut_linreg = LinearRegression(fit_intercept=False).fit(X, y)
ideal_cut_prediction = float(round(cut_linreg.coef_[0], 2))
### END SOLUTION

# Answer check
print(ideal_cut_prediction)
print(type(ideal_cut_prediction))

3457.54
<class 'float'>


In [13]:
### BEGIN HIDDEN TESTS
X_ = pd.get_dummies(diamonds[['cut']])
y_ = diamonds['price']
cut_linreg_ = LinearRegression(fit_intercept=False).fit(X_, y_)
ideal_cut_prediction_ = float(round(cut_linreg_.coef_[0], 2))
#
#
#
assert len(str(ideal_cut_prediction)) == len(str(ideal_cut_prediction_)), 'Make sure to round your answer'
assert ideal_cut_prediction == ideal_cut_prediction_, 'There is a difference in your prediction after rounding.'
### END HIDDEN TESTS

[Back to top](#Index:) 

## Problem 5

### Building a model on `clarity`

**4 Points**

Below, create a dummy encoded DataFrame of the `clarity` feature.  Assign this DataFrame to the variable `X` below, and the column `price` to `y`.  Use this encoded data to build a regression model to predict price.  Assign your fit model to the variable `clarity_linreg` below.  Be sure to set `fit_intercept = False`.  

In [14]:
### GRADED

X = ''
y = ''
clarity_linreg = ''

### BEGIN SOLUTION
X = pd.get_dummies(diamonds[['clarity']])
y = diamonds['price']
clarity_linreg = LinearRegression(fit_intercept=False).fit(X, y)
### END SOLUTION

# Answer check
print(clarity_linreg.coef_)

[2864.83910615 2523.11463748 3283.73707067 3839.45539102 3924.98939468
 3996.00114811 5063.02860561 3924.16869096]


In [15]:
### BEGIN HIDDEN TESTS
X_ = pd.get_dummies(diamonds[['clarity']])
y_ = diamonds['price']
clarity_linreg_ = LinearRegression(fit_intercept=False).fit(X_, y_)
#
#
#
assert len(str(ideal_cut_prediction)) == len(str(ideal_cut_prediction_)), 'Make sure to round your answer'
assert ideal_cut_prediction == ideal_cut_prediction_, 'There is a difference in your prediction after rounding.'
### END HIDDEN TESTS

[Back to top](#Index:) 

## Problem 6

### Interpreting the results

**4 Points**

Examine your coefficients and compare these to the columns of the dummy encoded version of the `clarity` column.  What price does your model predict for a diamond with clarity `SI2`?  Assign your results as a float rounded to 2 decimal places to `clarity_si2_prediction`.

In [16]:
### GRADED

clarity_si2_prediction = ''

### BEGIN SOLUTION
X = pd.get_dummies(diamonds[['clarity']])
y = diamonds['price']
cut_linreg = LinearRegression(fit_intercept=False).fit(X, y)
clarity_si2_prediction = float(round(cut_linreg.coef_[-2], 2))
### END SOLUTION

# Answer check
print(clarity_si2_prediction)
print(type(clarity_si2_prediction))

5063.03
<class 'float'>


In [17]:
print(clarity_linreg.coef_)

[2864.83910615 2523.11463748 3283.73707067 3839.45539102 3924.98939468
 3996.00114811 5063.02860561 3924.16869096]


In [18]:
### BEGIN HIDDEN TESTS
# Answer computation
X_ = pd.get_dummies(diamonds[['clarity']])
y_ = diamonds['price']
clarity_linreg1_ = LinearRegression(fit_intercept=False).fit(X_, y_)
clarity_si2_prediction_ = float(round(clarity_linreg1_.coef_[-2], 2))
#
#
#
# 
assert clarity_si2_prediction == clarity_si2_prediction_
### END HIDDEN TESTS

[Back to top](#Index:) 

## Problem 7

### A Model with `cut`, `clarity`, and `carat`

**4 Points**

Now, you are to build a model with three features -- `cut`, `clarity`, `carat`.  Create the dummy encoded data and use `LinearRegression` to build a model to predict `price`.  Assign your fit model to the variable `ccc_linreg` below.  Be sure to set `fit_intercept = False`.  

In [19]:
### GRADED

ccc_linreg = ''

### BEGIN SOLUTION
X = pd.get_dummies(diamonds[['carat', 'cut', 'clarity']])
y = diamonds['price']
ccc_linreg = LinearRegression(fit_intercept=False).fit(X, y)
### END SOLUTION

# Answer check
print(ccc_linreg)

LinearRegression(fit_intercept=False)


In [20]:
### BEGIN HIDDEN TESTS
# Answer computation
X_ = pd.get_dummies(diamonds[['carat', 'cut', 'clarity']])
y_ = diamonds['price']
ccc_linreg_ = LinearRegression(fit_intercept=False).fit(X_, y_)
coefs_ = ccc_linreg_.coef_
coefs = ccc_linreg.coef_
#
#
#
# 
np.testing.assert_array_equal(coefs, coefs_)
### END HIDDEN TESTS

[Back to top](#Index:) 

## Problem 8

### Interpreting the results

**4 Points**

Examine the coefficients from the model and use them to determine the predicted price of a diamond with the following features:

```
carat = 0.8
cut = Ideal
clarity = SI2
```

Assign your solution as a float rounded to two decimal places to the variable `ccc_prediction` below.  

In [21]:
### GRADED

ccc_prediction = ''

### BEGIN SOLUTION
diamonds_encoded = pd.get_dummies(diamonds[['carat', 'cut', 'clarity']])

diamond_features = pd.DataFrame({
    'carat': [0.8],
    'cut': ['Ideal'],
    'clarity': ['SI2']
})

diamond_features_encoded = pd.get_dummies(diamond_features).reindex(columns=diamonds_encoded.columns, fill_value=0)

ccc_linreg1 = LinearRegression(fit_intercept=False).fit(diamonds_encoded, diamonds['price'])

ccc_prediction = ccc_linreg1.predict(diamond_features_encoded)

ccc_prediction = round(ccc_prediction[0], 2)


### END SOLUTION

# Answer check
print(ccc_prediction)
print(type(ccc_prediction))

2880.0
<class 'numpy.float64'>


In [22]:
### BEGIN HIDDEN TESTS
# Answer computation
diamonds_encoded_ = pd.get_dummies(diamonds[['carat', 'cut', 'clarity']])
diamond_features_ = pd.DataFrame({
    'carat': [0.8],
    'cut': ['Ideal'],
    'clarity': ['SI2']
})
diamond_features_encoded_ = pd.get_dummies(diamond_features_).reindex(columns=diamonds_encoded.columns, fill_value=0)

ccc_linreg1_ = LinearRegression(fit_intercept=False).fit(diamonds_encoded_, diamonds['price'])
ccc_prediction1_ = ccc_linreg1_.predict(diamond_features_encoded_)
ccc_prediction1_ = round(ccc_prediction1_[0], 2)


coefs_ = ccc_linreg_.coef_
coefs = ccc_linreg.coef_

# Check if the coefficients match between the two models
assert np.array_equal(coefs, coefs_), "Coefficients do not match"
assert np.isclose(ccc_prediction1_, ccc_prediction), "Pridiction values do not match"

print("Test case passed!")

### END HIDDEN TESTS

Test case passed!


[Back to top](#Index:) 

## Problem 9

### A Model with all features

**4 Points**

Now, build a model that contains all features to predict `price`.  Be sure to dummy encode all of the features in the data.  Determine the `mean_squared_error` of your predictions.  Use the `LinearRegression` estimator and the `mean_squared_error` function from sklearn.  Be sure to set `fit_intercept = False`.  

Assign your fit model to the variable `all_features_linreg` and the mean squared error as a float to `linreg_mse` below.

In [23]:
### GRADED

X = ''
y = ''
all_features_linreg = ''
linreg_mse = ''

### BEGIN SOLUTION
X = pd.get_dummies(diamonds.drop('price', axis = 1))
y = diamonds['price']
all_features_linreg = LinearRegression(fit_intercept=False).fit(X, y)
linreg_mse = mean_squared_error(all_features_linreg.predict(X), y)
### END SOLUTION

# Answer check
print(all_features_linreg)
print(all_features_linreg.coef_)
print(linreg_mse)


LinearRegression(fit_intercept=False)
[ 1.12569783e+04 -6.38061004e+01 -2.64740847e+01 -1.00826110e+03
  9.60888648e+00 -5.01188909e+01  2.71221727e+03  2.64144937e+03
  2.60608801e+03  2.45905687e+03  1.87930542e+03  2.58257671e+03
  2.37345863e+03  2.30972288e+03  2.10053781e+03  1.60231004e+03
  1.11633224e+03  2.13178649e+02  3.06769746e+03  2.73035426e+03
  2.67340929e+03  2.30099313e+03  1.98981878e+03  1.38806730e+03
  4.25181510e+02 -2.27740478e+03]
1276545.174308389


In [24]:
diamonds.drop('price', axis = 1)

Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,4.20,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,4.34,4.35,2.75
...,...,...,...,...,...,...,...,...,...
53935,0.72,Ideal,D,SI1,60.8,57.0,5.75,5.76,3.50
53936,0.72,Good,D,SI1,63.1,55.0,5.69,5.75,3.61
53937,0.70,Very Good,D,SI1,62.8,60.0,5.66,5.68,3.56
53938,0.86,Premium,H,SI2,61.0,58.0,6.15,6.12,3.74


In [25]:
### BEGIN HIDDEN TESTS
# Answer computation
X_ = pd.get_dummies(diamonds.drop('price', axis = 1))
y_ = diamonds['price']
all_features_linreg_ = LinearRegression(fit_intercept=False).fit(X_, y_)
linreg_mse_ = mean_squared_error(all_features_linreg_.predict(X_), y_)
coefs_ = all_features_linreg_.coef_
coefs = all_features_linreg.coef_
#
#
#
# Compare variables using assert
np.testing.assert_array_equal(coefs, coefs_)
assert linreg_mse == linreg_mse_
### END HIDDEN TESTS

[Back to top](#Index:) 

## Problem 10

### A `HuberRegressor` on all features

**4 Points**

Using all the features as in the previous problem, build a model using the `HuberRegressor` estimator from `sklearn`.  Be sure to set `fit_intercept = False` and assign your fit model to `huber_all_features` below.  Compute the mean squared error of the Huber model and assign it as a float to the variable `huber_mse` below.

In [36]:
### GRADED

X = ''
y = ''
huber_all_features = ''
huber_mse = ''

### BEGIN SOLUTION
X = pd.get_dummies(diamonds.drop('price', axis = 1))
y = diamonds['price']
huber_all_features = HuberRegressor(fit_intercept=False).fit(X, y)
huber_mse = mean_squared_error(huber_all_features.predict(X), y)
### END SOLUTION

# Answer check
print(huber_all_features)
print(huber_all_features.coef_)
print(huber_mse)

HuberRegressor(fit_intercept=False)
[ 8.36864549e+03  4.77371395e+00 -5.00383795e+01 -1.25436182e+02
 -3.20525008e+02  5.68317878e+02 -5.36145917e+01  1.04889337e+02
  2.87007831e+02  1.03901240e+02 -7.26953341e+02  8.36144257e+02
  5.29405464e+02  3.66665899e+02  4.94988592e+02  1.97529610e+01
 -5.13222042e+02 -2.01850465e+03  1.84632242e+03  1.04080121e+03
  5.04524144e+02  1.76177713e+02  2.77300199e+02 -5.00095751e+02
 -1.19892424e+03 -2.43087523e+03]


KeyboardInterrupt: 

In [27]:
### BEGIN HIDDEN TESTS
# Answer computation
X_ = pd.get_dummies(diamonds.drop('price', axis = 1))
y_ = diamonds['price']
huber_all_features_ = HuberRegressor(fit_intercept=False).fit(X_, y_)
huber_mse_ = mean_squared_error(huber_all_features_.predict(X_), y_)
coefs_ = huber_all_features_.coef_
coefs = huber_all_features.coef_
#
#
#
# Compare variables using assert
np.testing.assert_array_equal(coefs, coefs_)
assert huber_mse == huber_mse_
### END HIDDEN TESTS

### Conclusion

While some basic initial models have been explored here, there is much more to explore to fine tune things. One thing that could be revisited is the representation of features through transformations and the engineering of different representations of existing features.  For example, the dimensions of the diamond in `x`, `y`, and `z` could be multiplied to create a feature "volume".  This allows for a more reasonable representation of three columns of data with one.  A second approach we might take is to use PCA to reduce the dimensionality of the data.  Third is to use clustering to engineer new features based on the cluster results.  Consider exploring different representations of the features and trying to improve these initial models.