<a href="https://colab.research.google.com/github/Avery1493/DS-Unit-2-Linear-Models/blob/master/module3-ridge-regression/Quinn_213_LS_DS_213_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 3*

---

# Ridge Regression

## Assignment

We're going back to our other **New York City** real estate dataset. Instead of predicting apartment rents, you'll predict property sales prices.

But not just for condos in Tribeca...

- [ ] Use a subset of the data where `BUILDING_CLASS_CATEGORY` == `'01 ONE FAMILY DWELLINGS'` and the sale price was more than 100 thousand and less than 2 million.
- [ ] Do train/test split. Use data from January — March 2019 to train. Use data from April 2019 to test.
- [ ] Do one-hot encoding of categorical features.
- [ ] Do feature selection with `SelectKBest`.
- [ ] Fit a ridge regression model with multiple features. Use the `normalize=True` parameter (or do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html) beforehand — use the scaler's `fit_transform` method with the train set, and the scaler's `transform` method with the test set)
- [ ] Get mean absolute error for the test set.
- [ ] As always, commit your notebook to your fork of the GitHub repo.

The [NYC Department of Finance](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page) has a glossary of property sales terms and NYC Building Class Code Descriptions. The data comes from the [NYC OpenData](https://data.cityofnewyork.us/browse?q=NYC%20calendar%20sales) portal.


## Stretch Goals

Don't worry, you aren't expected to do all these stretch goals! These are just ideas to consider and choose from.

- [ ] Add your own stretch goal(s) !
- [ ] Instead of `Ridge`, try `LinearRegression`. Depending on how many features you select, your errors will probably blow up! 💥
- [ ] Instead of `Ridge`, try [`RidgeCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html).
- [ ] Learn more about feature selection:
    - ["Permutation importance"](https://www.kaggle.com/dansbecker/permutation-importance)
    - [scikit-learn's User Guide for Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html)
    - [mlxtend](http://rasbt.github.io/mlxtend/) library
    - scikit-learn-contrib libraries: [boruta_py](https://github.com/scikit-learn-contrib/boruta_py) & [stability-selection](https://github.com/scikit-learn-contrib/stability-selection)
    - [_Feature Engineering and Selection_](http://www.feat.engineering/) by Kuhn & Johnson.
- [ ] Try [statsmodels](https://www.statsmodels.org/stable/index.html) if you’re interested in more inferential statistical approach to linear regression and feature selection, looking at p values and 95% confidence intervals for the coefficients.
- [ ] Read [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf), Chapters 1-3, for more math & theory, but in an accessible, readable way.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

# Imports

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import pandas as pd
import pandas_profiling

# Read New York City property sales data
df = pd.read_csv(DATA_PATH+'condos/NYC_Citywide_Rolling_Calendar_Sales.csv')

# Change column names: replace spaces with underscores
df.columns = [col.replace(' ', '_') for col in df]

# SALE_PRICE was read as strings.
# Remove symbols, convert to integer
df['SALE_PRICE'] = (
    df['SALE_PRICE']
    .str.replace('$','')
    .str.replace('-','')
    .str.replace(',','')
    .astype(int)
)

In [0]:
# BOROUGH is a numeric column, but arguably should be a categorical feature,
# so convert it from a number to a string
df['BOROUGH'] = df['BOROUGH'].astype(str)

In [0]:
# Reduce cardinality for NEIGHBORHOOD feature

# Get a list of the top 10 neighborhoods
top10 = df['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10, 
# replace the neighborhood with 'OTHER'
df.loc[~df['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'

In [0]:
print(df.shape)
df.sample(5)

(23040, 21)


Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
3577,5,OTHER,01 ONE FAMILY DWELLINGS,1,4419,60,,A2,196 AMBER STREET,,10306.0,1.0,0.0,1.0,4300,1104.0,1955.0,1,A2,697000,01/18/2019
7574,3,OTHER,01 ONE FAMILY DWELLINGS,1,1203,38,,S1,68 KINGSTON AVENUE,,11213.0,1.0,1.0,2.0,1000,1840.0,1940.0,1,S1,0,02/07/2019
4694,4,OTHER,02 TWO FAMILY DWELLINGS,1,11986,37,,B2,142-11 116TH AVENUE,,11436.0,2.0,0.0,2.0,4000,2250.0,1955.0,1,B2,0,01/24/2019
11249,4,OTHER,01 ONE FAMILY DWELLINGS,1,3160,28,,A5,6714 GROTON STREET,,11375.0,1.0,0.0,1.0,2200,1084.0,1940.0,1,A5,940000,02/26/2019
18135,5,OTHER,01 ONE FAMILY DWELLINGS,1,3420,7,,A5,116 WINFIELD STREET,,10305.0,1.0,0.0,1.0,2185,1408.0,1997.0,1,A5,0,03/29/2019


#  Use a subset of the data where BUILDING_CLASS_CATEGORY == '01 ONE FAMILY DWELLINGS' and the sale price was more than 100 thousand and less than 2 million.

In [0]:
#Subsetting Building Class '01 ONE FAMILY DWELLINGS' and Sale price between 100 thousand 2 million
single_fam = df[(df['SALE_PRICE'] >= 100000) &
   (df['SALE_PRICE'] <= 2000000) &
   (df['BUILDING_CLASS_CATEGORY'] == '01 ONE FAMILY DWELLINGS')]

print(single_fam.shape)
single_fam.sample(3)

(3164, 21)


Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
11101,2,OTHER,01 ONE FAMILY DWELLINGS,1,3986,48,,A5,2515 GLEBE AVENUE,,10461.0,1.0,0.0,1.0,1800,1690.0,1955.0,1,A5,485000,02/26/2019
12399,2,OTHER,01 ONE FAMILY DWELLINGS,1,4078,29,,A1,2532 POPLAR STREET,,10461.0,1.0,0.0,1.0,2908,1300.0,1925.0,1,A1,285000,03/04/2019
6229,3,OTHER,01 ONE FAMILY DWELLINGS,1,7637,64,,A9,1270 EAST 38 STREET,,11210.0,1.0,0.0,1.0,2067,1176.0,1930.0,1,A9,522500,01/31/2019


#  Do train/test split. Use data from January — March 2019 to train. Use data from April 2019 to test.

In [0]:
single_fam.dtypes

BOROUGH                            object
NEIGHBORHOOD                       object
BUILDING_CLASS_CATEGORY            object
TAX_CLASS_AT_PRESENT               object
BLOCK                               int64
LOT                                 int64
EASE-MENT                         float64
BUILDING_CLASS_AT_PRESENT          object
ADDRESS                            object
APARTMENT_NUMBER                   object
ZIP_CODE                          float64
RESIDENTIAL_UNITS                 float64
COMMERCIAL_UNITS                  float64
TOTAL_UNITS                       float64
LAND_SQUARE_FEET                   object
GROSS_SQUARE_FEET                 float64
YEAR_BUILT                        float64
TAX_CLASS_AT_TIME_OF_SALE           int64
BUILDING_CLASS_AT_TIME_OF_SALE     object
SALE_PRICE                          int64
SALE_DATE                          object
dtype: object

In [0]:
#Making SALE DATE datetime type and setting cut off for train and test data
single_fam['SALE_DATE']= pd.to_datetime(single_fam['SALE_DATE'], infer_datetime_format=True)
cutoff = pd.to_datetime('2019-04-01')
train = single_fam[single_fam['SALE_DATE'] < cutoff]
test  = single_fam[single_fam['SALE_DATE'] >= cutoff]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [0]:
train.sample(5)

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
8471,3,OTHER,01 ONE FAMILY DWELLINGS,1,6543,20,,A9,1236 EAST 8TH ST.,,11230.0,1.0,0.0,1.0,2410,1583.0,1915.0,1,A9,1200000,2019-02-12
4939,4,OTHER,01 ONE FAMILY DWELLINGS,1,6599,28,,A5,136-34 71ST ROAD,,11367.0,1.0,0.0,1.0,1800,1660.0,1940.0,1,A5,785000,2019-01-25
13338,5,OTHER,01 ONE FAMILY DWELLINGS,1,3348,11,,A1,38 LACONIA AVENUE,,10305.0,1.0,0.0,1.0,3160,1970.0,1930.0,1,A1,683000,2019-03-07
16465,4,OTHER,01 ONE FAMILY DWELLINGS,1,11615,58,,A1,10977 135TH STREET,,11420.0,1.0,0.0,1.0,2507,1258.0,1955.0,1,A1,540000,2019-03-22
7686,4,OTHER,01 ONE FAMILY DWELLINGS,1,13569,9,,A1,243-45 145TH AVE,,11422.0,1.0,0.0,1.0,3000,1624.0,1940.0,1,A1,130000,2019-02-07


#  Do one-hot encoding of categorical features.

In [0]:
#Looking at categorical features
train.describe(exclude='number').T.sort_values(by='unique')

Unnamed: 0,count,unique,top,freq,first,last
BUILDING_CLASS_CATEGORY,2517,1,01 ONE FAMILY DWELLINGS,2517,NaT,NaT
APARTMENT_NUMBER,1,1,RP.,1,NaT,NaT
TAX_CLASS_AT_PRESENT,2517,2,1,2486,NaT,NaT
BOROUGH,2517,5,4,1209,NaT,NaT
NEIGHBORHOOD,2517,7,OTHER,2368,NaT,NaT
BUILDING_CLASS_AT_TIME_OF_SALE,2517,11,A1,921,NaT,NaT
BUILDING_CLASS_AT_PRESENT,2517,13,A1,921,NaT,NaT
SALE_DATE,2517,68,2019-01-31 00:00:00,78,2019-01-01,2019-03-30
LAND_SQUARE_FEET,2517,889,4000,235,NaT,NaT
ADDRESS,2517,2507,33 BAILEY PLACE,2,NaT,NaT


In [0]:
#Splitting data into train and test
target = 'SALE_PRICE'
high_cardinality = ['ADDRESS', 'LAND_SQUARE_FEET',
                    'SALE_DATE']
NANS = ['EASE-MENT']
features = train.columns.drop([target] + high_cardinality + NANS )

In [0]:
X_train = train[features]
y_train = train[target]
X_test = test[features]
y_test = test[target]

In [0]:
#Import encoder
import category_encoders as ce
encoder = ce.OneHotEncoder(use_cat_names=True)
X_train = encoder.fit_transform(X_train)

In [0]:
#AFTER ENCODEING
print(X_train.shape)
X_train.sample(5) 

(2517, 50)


Unnamed: 0,BOROUGH_3,BOROUGH_4,BOROUGH_2,BOROUGH_5,BOROUGH_1,NEIGHBORHOOD_OTHER,NEIGHBORHOOD_FLUSHING-NORTH,NEIGHBORHOOD_EAST NEW YORK,NEIGHBORHOOD_BEDFORD STUYVESANT,NEIGHBORHOOD_FOREST HILLS,NEIGHBORHOOD_BOROUGH PARK,NEIGHBORHOOD_ASTORIA,BUILDING_CLASS_CATEGORY_01 ONE FAMILY DWELLINGS,TAX_CLASS_AT_PRESENT_1,TAX_CLASS_AT_PRESENT_1D,BLOCK,LOT,BUILDING_CLASS_AT_PRESENT_A9,BUILDING_CLASS_AT_PRESENT_A1,BUILDING_CLASS_AT_PRESENT_A5,BUILDING_CLASS_AT_PRESENT_A0,BUILDING_CLASS_AT_PRESENT_A2,BUILDING_CLASS_AT_PRESENT_A3,BUILDING_CLASS_AT_PRESENT_S1,BUILDING_CLASS_AT_PRESENT_A4,BUILDING_CLASS_AT_PRESENT_A6,BUILDING_CLASS_AT_PRESENT_A8,BUILDING_CLASS_AT_PRESENT_B2,BUILDING_CLASS_AT_PRESENT_S0,BUILDING_CLASS_AT_PRESENT_B3,APARTMENT_NUMBER_nan,APARTMENT_NUMBER_RP.,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE_A9,BUILDING_CLASS_AT_TIME_OF_SALE_A1,BUILDING_CLASS_AT_TIME_OF_SALE_A5,BUILDING_CLASS_AT_TIME_OF_SALE_A0,BUILDING_CLASS_AT_TIME_OF_SALE_A2,BUILDING_CLASS_AT_TIME_OF_SALE_A3,BUILDING_CLASS_AT_TIME_OF_SALE_S1,BUILDING_CLASS_AT_TIME_OF_SALE_A4,BUILDING_CLASS_AT_TIME_OF_SALE_A6,BUILDING_CLASS_AT_TIME_OF_SALE_A8,BUILDING_CLASS_AT_TIME_OF_SALE_S0
4818,0,0,1,0,0,1,0,0,0,0,0,0,1,0,1,5455,60,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,10465.0,0.0,0.0,0.0,0.0,1921.0,1,0,0,0,0,0,0,0,0,0,1,0
4490,0,0,1,0,0,1,0,0,0,0,0,0,1,1,0,3253,138,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,10463.0,1.0,0.0,1.0,1578.0,1920.0,1,0,0,1,0,0,0,0,0,0,0,0
3877,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,6975,33,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,11366.0,1.0,0.0,1.0,1329.0,1960.0,1,0,0,0,0,1,0,0,0,0,0,0
8581,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,10990,48,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,11412.0,1.0,0.0,1.0,1256.0,1930.0,1,0,1,0,0,0,0,0,0,0,0,0
6655,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,9898,107,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,11423.0,1.0,0.0,1.0,1232.0,1925.0,1,0,1,0,0,0,0,0,0,0,0,0


In [0]:
#Test data after encoding
X_test =  encoder.transform(X_test)
print(X_test.shape)
X_test.sample(5) 

(647, 50)


Unnamed: 0,BOROUGH_3,BOROUGH_4,BOROUGH_2,BOROUGH_5,BOROUGH_1,NEIGHBORHOOD_OTHER,NEIGHBORHOOD_FLUSHING-NORTH,NEIGHBORHOOD_EAST NEW YORK,NEIGHBORHOOD_BEDFORD STUYVESANT,NEIGHBORHOOD_FOREST HILLS,NEIGHBORHOOD_BOROUGH PARK,NEIGHBORHOOD_ASTORIA,BUILDING_CLASS_CATEGORY_01 ONE FAMILY DWELLINGS,TAX_CLASS_AT_PRESENT_1,TAX_CLASS_AT_PRESENT_1D,BLOCK,LOT,BUILDING_CLASS_AT_PRESENT_A9,BUILDING_CLASS_AT_PRESENT_A1,BUILDING_CLASS_AT_PRESENT_A5,BUILDING_CLASS_AT_PRESENT_A0,BUILDING_CLASS_AT_PRESENT_A2,BUILDING_CLASS_AT_PRESENT_A3,BUILDING_CLASS_AT_PRESENT_S1,BUILDING_CLASS_AT_PRESENT_A4,BUILDING_CLASS_AT_PRESENT_A6,BUILDING_CLASS_AT_PRESENT_A8,BUILDING_CLASS_AT_PRESENT_B2,BUILDING_CLASS_AT_PRESENT_S0,BUILDING_CLASS_AT_PRESENT_B3,APARTMENT_NUMBER_nan,APARTMENT_NUMBER_RP.,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE_A9,BUILDING_CLASS_AT_TIME_OF_SALE_A1,BUILDING_CLASS_AT_TIME_OF_SALE_A5,BUILDING_CLASS_AT_TIME_OF_SALE_A0,BUILDING_CLASS_AT_TIME_OF_SALE_A2,BUILDING_CLASS_AT_TIME_OF_SALE_A3,BUILDING_CLASS_AT_TIME_OF_SALE_S1,BUILDING_CLASS_AT_TIME_OF_SALE_A4,BUILDING_CLASS_AT_TIME_OF_SALE_A6,BUILDING_CLASS_AT_TIME_OF_SALE_A8,BUILDING_CLASS_AT_TIME_OF_SALE_S0
19121,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,11695,42,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,11420.0,1.0,0.0,1.0,448.0,1935.0,1,0,0,0,0,1,0,0,0,0,0,0
21868,0,1,0,0,0,0,1,0,0,0,0,0,1,1,0,5666,54,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,11365.0,1.0,0.0,1.0,1633.0,1955.0,1,0,0,1,0,0,0,0,0,0,0,0
20015,0,0,1,0,0,1,0,0,0,0,0,0,1,1,0,3086,13,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,10457.0,1.0,0.0,1.0,2184.0,1905.0,1,0,1,0,0,0,0,0,0,0,0,0
21394,1,0,0,0,0,1,0,0,0,0,0,0,1,1,0,7895,24,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,11234.0,1.0,0.0,1.0,1041.0,1920.0,1,1,0,0,0,0,0,0,0,0,0,0
22114,1,0,0,0,0,1,0,0,0,0,0,0,1,1,0,4760,52,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,11203.0,1.0,0.0,1.0,1832.0,1930.0,1,0,0,1,0,0,0,0,0,0,0,0


#  Do feature selection with SelectKBest.

In [0]:
#Import
from sklearn.feature_selection import SelectKBest, f_regression
selector = SelectKBest(score_func= f_regression, k = 26)

In [0]:
#fit and transform on train
X_train_selected = selector.fit_transform(X_train,y_train)
X_train_selected.shape

  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)


(2517, 26)

In [0]:
#Showing selected features
selected_mask = selector.get_support()
all_names = X_train.columns
selected_names = all_names[selected_mask]
unselected_names = all_names[~selected_mask]
selected_names

Index(['BOROUGH_3', 'BOROUGH_2', 'BOROUGH_5', 'NEIGHBORHOOD_OTHER',
       'NEIGHBORHOOD_FLUSHING-NORTH', 'NEIGHBORHOOD_FOREST HILLS',
       'NEIGHBORHOOD_BOROUGH PARK', 'TAX_CLASS_AT_PRESENT_1', 'BLOCK',
       'BUILDING_CLASS_AT_PRESENT_A5', 'BUILDING_CLASS_AT_PRESENT_A3',
       'BUILDING_CLASS_AT_PRESENT_S1', 'BUILDING_CLASS_AT_PRESENT_A4',
       'BUILDING_CLASS_AT_PRESENT_A6', 'BUILDING_CLASS_AT_PRESENT_S0',
       'ZIP_CODE', 'RESIDENTIAL_UNITS', 'COMMERCIAL_UNITS', 'TOTAL_UNITS',
       'GROSS_SQUARE_FEET', 'BUILDING_CLASS_AT_TIME_OF_SALE_A5',
       'BUILDING_CLASS_AT_TIME_OF_SALE_A3',
       'BUILDING_CLASS_AT_TIME_OF_SALE_S1',
       'BUILDING_CLASS_AT_TIME_OF_SALE_A4',
       'BUILDING_CLASS_AT_TIME_OF_SALE_A6',
       'BUILDING_CLASS_AT_TIME_OF_SALE_S0'],
      dtype='object')

In [0]:
#Seeing how many features minimize the mean absolute error

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

for k in range(1, len(X_train.columns)+1):
    print(f'{k} features')
    
    selector = SelectKBest(score_func=f_regression, k=k)
    X_train_selected = selector.fit_transform(X_train, y_train)
    X_test_selected = selector.transform(X_test)

    model = LinearRegression()
    model.fit(X_train_selected, y_train)
    y_pred = model.predict(X_test_selected)
    mae = mean_absolute_error(y_test, y_pred)
    print(f'Test Mean Absolute Error: ${mae:,.0f} \n')
    

  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freed

1 features
Test Mean Absolute Error: $185,788 

2 features
Test Mean Absolute Error: $184,748 

3 features
Test Mean Absolute Error: $184,748 

4 features
Test Mean Absolute Error: $185,822 

5 features
Test Mean Absolute Error: $184,054 

6 features
Test Mean Absolute Error: $178,877 

7 features
Test Mean Absolute Error: $179,359 

8 features
Test Mean Absolute Error: $179,041 

9 features
Test Mean Absolute Error: $174,044 

10 features
Test Mean Absolute Error: $165,820 

11 features
Test Mean Absolute Error: $165,486 

12 features
Test Mean Absolute Error: $164,614 

13 features
Test Mean Absolute Error: $166,183 

14 features
Test Mean Absolute Error: $165,316 

15 features
Test Mean Absolute Error: $165,317 

16 features
Test Mean Absolute Error: $165,053 

17 features
Test Mean Absolute Error: $165,053 

18 features


  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freed

Test Mean Absolute Error: $164,883 

19 features
Test Mean Absolute Error: $164,883 

20 features
Test Mean Absolute Error: $164,922 

21 features
Test Mean Absolute Error: $164,922 

22 features
Test Mean Absolute Error: $160,542 

23 features
Test Mean Absolute Error: $160,542 

24 features
Test Mean Absolute Error: $160,542 

25 features
Test Mean Absolute Error: $160,542 

26 features
Test Mean Absolute Error: $160,277 

27 features
Test Mean Absolute Error: $160,159 

28 features
Test Mean Absolute Error: $160,093 

29 features
Test Mean Absolute Error: $160,051 

30 features
Test Mean Absolute Error: $161,275 

31 features
Test Mean Absolute Error: $161,323 

32 features
Test Mean Absolute Error: $161,329 

33 features
Test Mean Absolute Error: $160,906 

34 features
Test Mean Absolute Error: $160,885 

35 features
Test Mean Absolute Error: $160,937 

36 features
Test Mean Absolute Error: $162,023 

37 features
Test Mean Absolute Error: $161,991 

38 features


  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)


Test Mean Absolute Error: $162,051 

39 features
Test Mean Absolute Error: $161,123 

40 features
Test Mean Absolute Error: $161,123 

41 features
Test Mean Absolute Error: $161,118 

42 features
Test Mean Absolute Error: $161,118 

43 features
Test Mean Absolute Error: $161,134 

44 features
Test Mean Absolute Error: $161,115 

45 features
Test Mean Absolute Error: $161,113 

46 features
Test Mean Absolute Error: $161,425 

47 features
Test Mean Absolute Error: $161,425 

48 features
Test Mean Absolute Error: $161,425 

49 features
Test Mean Absolute Error: $161,440 

50 features
Test Mean Absolute Error: $161,426 



  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freed

# Fit a ridge regression model with multiple features. Use the normalize=True parameter (or do feature scaling beforehand — use the scaler's fit_transform method with the train set, and the scaler's transform method with the test set)

In [0]:
features = ['BOROUGH_3', 'BOROUGH_2', 'BOROUGH_5', 'NEIGHBORHOOD_OTHER',
       'NEIGHBORHOOD_FLUSHING-NORTH', 'NEIGHBORHOOD_FOREST HILLS',
       'NEIGHBORHOOD_BOROUGH PARK', 'TAX_CLASS_AT_PRESENT_1', 'BLOCK',
       'BUILDING_CLASS_AT_PRESENT_A5', 'BUILDING_CLASS_AT_PRESENT_A3',
       'BUILDING_CLASS_AT_PRESENT_S1', 'BUILDING_CLASS_AT_PRESENT_A4',
       'BUILDING_CLASS_AT_PRESENT_A6', 'BUILDING_CLASS_AT_PRESENT_S0',
       'ZIP_CODE', 'RESIDENTIAL_UNITS', 'COMMERCIAL_UNITS', 'TOTAL_UNITS',
       'GROSS_SQUARE_FEET', 'BUILDING_CLASS_AT_TIME_OF_SALE_A5',
       'BUILDING_CLASS_AT_TIME_OF_SALE_A3',
       'BUILDING_CLASS_AT_TIME_OF_SALE_S1',
       'BUILDING_CLASS_AT_TIME_OF_SALE_A4',
       'BUILDING_CLASS_AT_TIME_OF_SALE_A6',
       'BUILDING_CLASS_AT_TIME_OF_SALE_S0']

In [0]:
X_train_selected = X_train[features]
X_test_selected = X_test[features]

In [0]:
alphas = [0.01, 0.1, 1.0, 10.0, 100.0]

In [0]:
from sklearn.linear_model import RidgeCV
ridge = RidgeCV(alphas=alphas, normalize=True)
ridge.fit(X_train_selected, y_train)
ridge.alpha_

0.01

In [0]:
from sklearn.linear_model import Ridge
alpha = ridge.alpha_
model = Ridge(alpha = alpha, normalize = True)

In [0]:
model.fit(X_train_selected, y_train)

Ridge(alpha=0.01, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=True, random_state=None, solver='auto', tol=0.001)

In [0]:
print(model.alpha)
print(model.coef_)
print(model.intercept_)

0.01
[ 5.47078599e+04 -3.26760614e+05 -2.90996174e+05  2.29292678e+05
  3.53749367e+05  5.37744368e+05  4.22546902e+05 -1.02267049e+05
 -2.58047883e+01 -2.34348179e+04  7.58457948e+04 -9.31558680e+02
  3.53100073e+04 -6.76109242e+04  4.91754412e+05 -3.94975151e+01
 -1.02267049e+05  2.23789818e+04 -2.95277768e+04  2.01466167e+02
 -3.16930141e+04  7.58457948e+04 -9.31558680e+02  3.53100073e+04
 -6.76109242e+04  4.91754412e+05]
1045888.2964677641


<bound method RegressorMixin.score of Ridge(alpha=0.01, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=True, random_state=None, solver='auto', tol=0.001)>

In [0]:
y_pred = model.predict(X_test_selected)

In [0]:
from sklearn.metrics import mean_absolute_error

mae = mean_absolute_error(y_test, y_pred)
mae

160142.519549024