<a href="https://colab.research.google.com/github/Tclack88/Lambda/blob/master/precourse/LSDS_Intro_Assignment_7_More_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lambda School, Intro to Data Science, Day 7 — More Regression!

## Assignment

### 1. Experiment with Nearest Neighbor parameter

Using the same 10 training data points from the lesson, train a `KNeighborsRegressor` model with `n_neighbors=1`.

Use both `carat` and `cut` features.

Calculate the mean absolute error on the training data and on the test data.

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error

columns = ['carat', 'cut', 'price']

train = pd.DataFrame(columns=columns, 
        data=[[0.3, 'Ideal', 422],
        [0.31, 'Ideal', 489],
        [0.42, 'Premium', 737],
        [0.5, 'Ideal', 1415],
        [0.51, 'Premium', 1177],
        [0.7, 'Fair', 1865],
        [0.73, 'Fair', 2351],
        [1.01, 'Good', 3768],
        [1.18, 'Very Good', 3965],
        [1.18, 'Ideal', 4838]])


test  = pd.DataFrame(columns=columns, 
        data=[[0.3, 'Ideal', 432],
        [0.34, 'Ideal', 687],
        [0.37, 'Premium', 1124],
        [0.4, 'Good', 720],
        [0.51, 'Ideal', 1397],
        [0.51, 'Very Good', 1284],
        [0.59, 'Ideal', 1437],
        [0.7, 'Ideal', 3419],
        [0.9, 'Premium', 3484],
        [0.9, 'Fair', 2964]])

cut_ranks = {'Fair': 1, 'Good': 2, 'Very Good': 3, 'Premium': 4, 'Ideal': 5}
train.cut = train.cut.map(cut_ranks)
test.cut = test.cut.map(cut_ranks)

In [0]:
# Create Model
features = ['carat','cut']
target = 'price'
model = KNeighborsRegressor(n_neighbors=1)
model.fit(train[features],train[target])

# Test Model
prediction = model.predict(test[features])
mae_ = mean_absolute_error(prediction,train[target])
print('mean absolute error for K nearest regression (1 nearest neighbor):',mae,"buckaroonees")

mean absolute error for K nearest regression (1 nearest neighbor): 1325.5 buckaroonees


How does the train error and test error compare to the previous `KNeighborsRegressor` model from the lesson? (The previous model used `n_neighbors=2` and only the `carat` feature.)

Is this new model overfitting or underfitting? Why do you think this is happening here? 



### 2. More data, two features, linear regression

Use the following code to load data for diamonds under $5,000, and split the data into train and test sets. The training data has almost 30,000 rows, and the test data has almost 10,000 rows.

In [0]:
import seaborn as sb
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

df = sb.load_dataset('diamonds')
df = df[df.price < 5000]
df.cut = df.cut.map(cut_ranks)
train, test = train_test_split(df.copy(), random_state=0)
train.shape, test.shape

((29409, 10), (9804, 10))

Then, train a Linear Regression model with the `carat` and `cut` features. Calculate the mean absolute error on the training data and on the test data.

In [0]:
model = LinearRegression()
model.fit(train[features],train[target])
linear_regression_predictions = model.predict(test[features])
mae_lin_reg = mean_absolute_error(linear_regression_predictions,test[target])
print("mean absolute error: $",round(mae_lin_reg,2))

mean absolute error: $ 309.52


Use this model to predict the price of a half carat diamond with "very good" cut

In [0]:

print("predicted price for .5 carat diamond with \"very good\" cut: $",round(float(model.predict([[.5,2]])),2))

predicted price for .5 carat diamond with "very good" cut: $ 1415.67


### 3. More data, more features, any model

You choose what features and model type to use! Try to get a better mean absolute error on the test set than your model from the last question.

Refer to [this documentation](https://ggplot2.tidyverse.org/reference/diamonds.html) for more explanation of the features.

Besides `cut`, there are two more ordinal features, which you'd need to encode as numbers if you want to use in your model:

In [0]:
train.describe(include=['object'])

Unnamed: 0,color,clarity
count,29409,29409
unique,7,8
top,E,SI1
freq,6090,6948


In [0]:
clarity_rank = {"IF":0,"VVS1":1, "VVS2":2,"VS1":3, "VS2":4,"SI1":5, "SI2":6, "I1":7}
train.clarity = train.clarity.map(clarity_rank)
test.clarity = test.clarity.map(clarity_rank)

color_rank = {"J":7, "I":6, "H":5, "G":4, "F":3, "E":2, "D":1 }
train.color = train.color.map(color_rank)
test.color = test.color.map(color_rank)

# the size of a diamond is important, I introduce a volume term to represent x,y,z together
test['volume'] = test.x * test.y * test.z
train['volume'] = train.x * train.y * train.z

print(test.head())

       carat  cut  color  clarity  depth  ...  price     x     y     z      volume
9742    1.20    5      7        6   61.7  ...   4659  6.79  6.87  4.21  196.385133
9374    0.32    5      6        1   62.5  ...    589  4.37  4.40  2.74   52.684720
10683   1.01    5      3        6   62.7  ...   4843  6.36  6.39  4.00  162.561600
4589    1.01    2      7        4   62.8  ...   3655  6.30  6.35  3.97  158.819850
2196    0.90    2      2        6   63.4  ...   3139  6.00  6.02  3.81  137.617200

[5 rows x 11 columns]


In [0]:

features = ['cut','color','clarity','volume']
target = ['price']
model = LinearRegression()
model.fit(train[features],train[target])
predictions_extended = model.predict(test[features])
mae_more_features = mean_absolute_error(predictions_extended,test[target])

print("mean absolute error:",round(mae_more_features,2))
print("\nfeature coefficients:")
coefs, intercept = list(model.coef_[0]), model.intercept_
for feature in enumerate(features):
  print(feature[1],round(coefs[feature[0]],2))
print("\nintercept:")
print(round(float(intercept),2))

mean absolute error: 248.59

feature coefficients:
cut 24.55
color -103.79
clarity -144.07
volume 33.07

intercept:
-363.78
