Lambda School Data Science

*Unit 2, Sprint 1, Module 3*

---

# Ridge Regression

## Assignment

We're going back to our other **New York City** real estate dataset. Instead of predicting apartment rents, you'll predict property sales prices.

But not just for condos in Tribeca...

- [ ] Use a subset of the data where `BUILDING_CLASS_CATEGORY` == `'01 ONE FAMILY DWELLINGS'` and the sale price was more than 100 thousand and less than 2 million.
- [ ] Do train/test split. Use data from January — March 2019 to train. Use data from April 2019 to test.
- [ ] Do one-hot encoding of categorical features.
- [ ] Do feature selection with `SelectKBest`.
- [ ] Fit a ridge regression model with multiple features. Use the `normalize=True` parameter (or do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html) beforehand — use the scaler's `fit_transform` method with the train set, and the scaler's `transform` method with the test set)
- [ ] Get mean absolute error for the test set.
- [ ] As always, commit your notebook to your fork of the GitHub repo.

The [NYC Department of Finance](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page) has a glossary of property sales terms and NYC Building Class Code Descriptions. The data comes from the [NYC OpenData](https://data.cityofnewyork.us/browse?q=NYC%20calendar%20sales) portal.


## Stretch Goals

Don't worry, you aren't expected to do all these stretch goals! These are just ideas to consider and choose from.

- [ ] Add your own stretch goal(s) !
- [ ] Instead of `Ridge`, try `LinearRegression`. Depending on how many features you select, your errors will probably blow up! 💥
- [ ] Instead of `Ridge`, try [`RidgeCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html).
- [ ] Learn more about feature selection:
    - ["Permutation importance"](https://www.kaggle.com/dansbecker/permutation-importance)
    - [scikit-learn's User Guide for Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html)
    - [mlxtend](http://rasbt.github.io/mlxtend/) library
    - scikit-learn-contrib libraries: [boruta_py](https://github.com/scikit-learn-contrib/boruta_py) & [stability-selection](https://github.com/scikit-learn-contrib/stability-selection)
    - [_Feature Engineering and Selection_](http://www.feat.engineering/) by Kuhn & Johnson.
- [ ] Try [statsmodels](https://www.statsmodels.org/stable/index.html) if you’re interested in more inferential statistical approach to linear regression and feature selection, looking at p values and 95% confidence intervals for the coefficients.
- [ ] Read [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf), Chapters 1-3, for more math & theory, but in an accessible, readable way.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [2]:
import pandas as pd
import pandas_profiling

# Read New York City property sales data
df = pd.read_csv(DATA_PATH+'condos/NYC_Citywide_Rolling_Calendar_Sales.csv')

# Change column names: replace spaces with underscores
df.columns = [col.replace(' ', '_') for col in df]

# SALE_PRICE was read as strings.
# Remove symbols, convert to integer
df['SALE_PRICE'] = (
    df['SALE_PRICE']
    .str.replace('$','')
    .str.replace('-','')
    .str.replace(',','')
    .astype(int)
)

In [3]:
# BOROUGH is a numeric column, but arguably should be a categorical feature,
# so convert it from a number to a string
df['BOROUGH'] = df['BOROUGH'].astype(str)

In [4]:
# Reduce cardinality for NEIGHBORHOOD feature

# Get a list of the top 10 neighborhoods
top10 = df['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10, 
# replace the neighborhood with 'OTHER'
df.loc[~df['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'

In [5]:
print(df.shape)
df.head()

(23040, 21)


Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,...,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
0,1,OTHER,13 CONDOS - ELEVATOR APARTMENTS,2,716,1246,,R4,"447 WEST 18TH STREET, PH12A",PH12A,...,1.0,0.0,1.0,10733,1979.0,2007.0,2,R4,0,01/01/2019
1,1,OTHER,21 OFFICE BUILDINGS,4,812,68,,O5,144 WEST 37TH STREET,,...,0.0,6.0,6.0,2962,15435.0,1920.0,4,O5,0,01/01/2019
2,1,OTHER,21 OFFICE BUILDINGS,4,839,69,,O5,40 WEST 38TH STREET,,...,0.0,7.0,7.0,2074,11332.0,1930.0,4,O5,0,01/01/2019
3,1,OTHER,13 CONDOS - ELEVATOR APARTMENTS,2,592,1041,,R4,"1 SHERIDAN SQUARE, 8C",8C,...,1.0,0.0,1.0,0,500.0,0.0,2,R4,0,01/01/2019
4,1,UPPER EAST SIDE (59-79),15 CONDOS - 2-10 UNIT RESIDENTIAL,2C,1379,1402,,R1,"20 EAST 65TH STREET, B",B,...,1.0,0.0,1.0,0,6406.0,0.0,2,R1,0,01/01/2019


In [6]:
#We are treating the data to only work with one family dwellings.
working = df[df['BUILDING_CLASS_CATEGORY'] == '01 ONE FAMILY DWELLINGS']

In [7]:
#We are also treating the data to work with the 100,000 to 2,000,000 property.
working = working[working['SALE_PRICE'] >= 100000]
working.head()

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,...,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
44,3,OTHER,01 ONE FAMILY DWELLINGS,1,5495,801,,A9,4832 BAY PARKWAY,,...,1.0,0.0,1.0,6800,1325.0,1930.0,1,A9,550000,01/01/2019
61,4,OTHER,01 ONE FAMILY DWELLINGS,1,7918,72,,A1,80-23 232ND STREET,,...,1.0,0.0,1.0,4000,2001.0,1940.0,1,A1,200000,01/01/2019
78,2,OTHER,01 ONE FAMILY DWELLINGS,1,4210,19,,A1,1260 RHINELANDER AVE,,...,1.0,0.0,1.0,3500,2043.0,1925.0,1,A1,810000,01/02/2019
108,3,OTHER,01 ONE FAMILY DWELLINGS,1,5212,69,,A1,469 E 25TH ST,,...,1.0,0.0,1.0,4000,2680.0,1899.0,1,A1,125000,01/02/2019
111,3,OTHER,01 ONE FAMILY DWELLINGS,1,7930,121,,A5,5521 WHITTY LANE,,...,1.0,0.0,1.0,1710,1872.0,1940.0,1,A5,620000,01/02/2019


In [8]:
working = working[working['SALE_PRICE'] <= 2000000]
print(working.shape)
working.head()

(3164, 21)


Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,...,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
44,3,OTHER,01 ONE FAMILY DWELLINGS,1,5495,801,,A9,4832 BAY PARKWAY,,...,1.0,0.0,1.0,6800,1325.0,1930.0,1,A9,550000,01/01/2019
61,4,OTHER,01 ONE FAMILY DWELLINGS,1,7918,72,,A1,80-23 232ND STREET,,...,1.0,0.0,1.0,4000,2001.0,1940.0,1,A1,200000,01/01/2019
78,2,OTHER,01 ONE FAMILY DWELLINGS,1,4210,19,,A1,1260 RHINELANDER AVE,,...,1.0,0.0,1.0,3500,2043.0,1925.0,1,A1,810000,01/02/2019
108,3,OTHER,01 ONE FAMILY DWELLINGS,1,5212,69,,A1,469 E 25TH ST,,...,1.0,0.0,1.0,4000,2680.0,1899.0,1,A1,125000,01/02/2019
111,3,OTHER,01 ONE FAMILY DWELLINGS,1,7930,121,,A5,5521 WHITTY LANE,,...,1.0,0.0,1.0,1710,1872.0,1940.0,1,A5,620000,01/02/2019


In [9]:
working['SALE_DATE'] = pd.to_datetime(working['SALE_DATE'])
print(working.dtypes)

BOROUGH                                   object
NEIGHBORHOOD                              object
BUILDING_CLASS_CATEGORY                   object
TAX_CLASS_AT_PRESENT                      object
BLOCK                                      int64
LOT                                        int64
EASE-MENT                                float64
BUILDING_CLASS_AT_PRESENT                 object
ADDRESS                                   object
APARTMENT_NUMBER                          object
ZIP_CODE                                 float64
RESIDENTIAL_UNITS                        float64
COMMERCIAL_UNITS                         float64
TOTAL_UNITS                              float64
LAND_SQUARE_FEET                          object
GROSS_SQUARE_FEET                        float64
YEAR_BUILT                               float64
TAX_CLASS_AT_TIME_OF_SALE                  int64
BUILDING_CLASS_AT_TIME_OF_SALE            object
SALE_PRICE                                 int32
SALE_DATE           

In [10]:
# Now we're splitting the data into test/train data. Train is January to March, test is April
train = working[working['SALE_DATE'] <= '2019-03-31']
test = working[working['SALE_DATE'] >= '2019-04-01']
test.head()

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,...,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
18235,2,OTHER,01 ONE FAMILY DWELLINGS,1,5913,878,,A1,4616 INDEPENDENCE AVENUE,,...,1.0,0.0,1.0,5000,2272.0,1930.0,1,A1,895000,2019-04-01
18239,2,OTHER,01 ONE FAMILY DWELLINGS,1,5488,48,,A2,558 ELLSWORTH AVENUE,,...,1.0,0.0,1.0,2500,720.0,1935.0,1,A2,253500,2019-04-01
18244,3,OTHER,01 ONE FAMILY DWELLINGS,1,5936,31,,A1,16 BAY RIDGE PARKWAY,,...,1.0,0.0,1.0,2880,2210.0,1925.0,1,A1,1300000,2019-04-01
18280,3,OTHER,01 ONE FAMILY DWELLINGS,1,7813,24,,A5,1247 EAST 40TH STREET,,...,1.0,0.0,1.0,1305,1520.0,1915.0,1,A5,789000,2019-04-01
18285,3,OTHER,01 ONE FAMILY DWELLINGS,1,8831,160,,A9,2314 PLUMB 2ND STREET,,...,1.0,0.0,1.0,1800,840.0,1925.0,1,A9,525000,2019-04-01


In [11]:
names = list(working.columns.values)
names

['BOROUGH',
 'NEIGHBORHOOD',
 'BUILDING_CLASS_CATEGORY',
 'TAX_CLASS_AT_PRESENT',
 'BLOCK',
 'LOT',
 'EASE-MENT',
 'BUILDING_CLASS_AT_PRESENT',
 'ADDRESS',
 'APARTMENT_NUMBER',
 'ZIP_CODE',
 'RESIDENTIAL_UNITS',
 'COMMERCIAL_UNITS',
 'TOTAL_UNITS',
 'LAND_SQUARE_FEET',
 'GROSS_SQUARE_FEET',
 'YEAR_BUILT',
 'TAX_CLASS_AT_TIME_OF_SALE',
 'BUILDING_CLASS_AT_TIME_OF_SALE',
 'SALE_PRICE',
 'SALE_DATE']

In [12]:
#The assigment is to one-hot-encode the categorical values. However, we know that Building_Class
#is just a single variable, so it doesn't matter. So lets look at the other categorical values.
#Neighborhood, Easement, Building_Class (at present and at sale), apartment number. 
#I can't imagine that apartment number will matter (also a lot of NaNs), but lets look at the others.
#Easement is also problematic, so I'll nix that one. 
print(working['NEIGHBORHOOD'].value_counts())
print(working['BOROUGH'].value_counts())
print(working['BUILDING_CLASS_AT_TIME_OF_SALE'].value_counts())
print(working['BUILDING_CLASS_AT_PRESENT'].value_counts())
working

OTHER                 2970
FLUSHING-NORTH          98
EAST NEW YORK           31
FOREST HILLS            22
BOROUGH PARK            19
ASTORIA                 14
BEDFORD STUYVESANT      10
Name: NEIGHBORHOOD, dtype: int64
4    1585
5     740
3     542
2     294
1       3
Name: BOROUGH, dtype: int64
A1    1189
A5     987
A2     496
A9     241
A0      85
S1      48
A3      43
A8      41
A4      17
A6      16
S0       1
Name: BUILDING_CLASS_AT_TIME_OF_SALE, dtype: int64
A1    1188
A5     987
A2     494
A9     241
A0      85
S1      48
A3      43
A8      41
A4      17
A6      16
B2       2
B3       1
S0       1
Name: BUILDING_CLASS_AT_PRESENT, dtype: int64


Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,...,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
44,3,OTHER,01 ONE FAMILY DWELLINGS,1,5495,801,,A9,4832 BAY PARKWAY,,...,1.0,0.0,1.0,6800,1325.0,1930.0,1,A9,550000,2019-01-01
61,4,OTHER,01 ONE FAMILY DWELLINGS,1,7918,72,,A1,80-23 232ND STREET,,...,1.0,0.0,1.0,4000,2001.0,1940.0,1,A1,200000,2019-01-01
78,2,OTHER,01 ONE FAMILY DWELLINGS,1,4210,19,,A1,1260 RHINELANDER AVE,,...,1.0,0.0,1.0,3500,2043.0,1925.0,1,A1,810000,2019-01-02
108,3,OTHER,01 ONE FAMILY DWELLINGS,1,5212,69,,A1,469 E 25TH ST,,...,1.0,0.0,1.0,4000,2680.0,1899.0,1,A1,125000,2019-01-02
111,3,OTHER,01 ONE FAMILY DWELLINGS,1,7930,121,,A5,5521 WHITTY LANE,,...,1.0,0.0,1.0,1710,1872.0,1940.0,1,A5,620000,2019-01-02
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23029,4,OTHER,01 ONE FAMILY DWELLINGS,1,13215,3,,A2,244-15 135 AVENUE,,...,1.0,0.0,1.0,3300,1478.0,1925.0,1,A2,635000,2019-04-30
23031,4,OTHER,01 ONE FAMILY DWELLINGS,1,11612,73,,A1,10919 132ND STREET,,...,1.0,0.0,1.0,2400,1280.0,1930.0,1,A1,514000,2019-04-30
23032,4,OTHER,01 ONE FAMILY DWELLINGS,1,11808,50,,A0,135-24 122ND STREET,,...,1.0,0.0,1.0,4000,1333.0,1945.0,1,A0,635000,2019-04-30
23033,4,OTHER,01 ONE FAMILY DWELLINGS,1,12295,23,,A1,134-34 157TH STREET,,...,1.0,0.0,1.0,2500,1020.0,1935.0,1,A1,545000,2019-04-30


In [13]:
target = 'SALE_PRICE'
high_cardinality = ['BUILDING_CLASS_CATEGORY','EASE-MENT','ADDRESS','APARTMENT_NUMBER',
                    'RESIDENTIAL_UNITS','COMMERCIAL_UNITS','BUILDING_CLASS_AT_PRESENT', 
                    'SALE_DATE','BUILDING_CLASS_AT_TIME_OF_SALE','LAND_SQUARE_FEET',
                   'TAX_CLASS_AT_PRESENT']
features = train.columns.drop([target]+high_cardinality)
features

Index(['BOROUGH', 'NEIGHBORHOOD', 'BLOCK', 'LOT', 'ZIP_CODE', 'TOTAL_UNITS',
       'GROSS_SQUARE_FEET', 'YEAR_BUILT', 'TAX_CLASS_AT_TIME_OF_SALE'],
      dtype='object')

In [14]:
X_train = train[features]
y_train = train[target]
X_test = test[features]
y_test = test[target]
X_train

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BLOCK,LOT,ZIP_CODE,TOTAL_UNITS,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE
44,3,OTHER,5495,801,11230.0,1.0,1325.0,1930.0,1
61,4,OTHER,7918,72,11427.0,1.0,2001.0,1940.0,1
78,2,OTHER,4210,19,10461.0,1.0,2043.0,1925.0,1
108,3,OTHER,5212,69,11226.0,1.0,2680.0,1899.0,1
111,3,OTHER,7930,121,11203.0,1.0,1872.0,1940.0,1
...,...,...,...,...,...,...,...,...,...
18129,5,OTHER,4081,44,10306.0,1.0,921.0,1950.0,1
18130,5,OTHER,2373,201,10314.0,1.0,2128.0,1980.0,1
18132,5,OTHER,1132,42,10302.0,1.0,1807.0,2018.0,1
18134,5,OTHER,3395,37,10305.0,1.0,621.0,1930.0,1


In [15]:
X_test.isnull().sum()

BOROUGH                      0
NEIGHBORHOOD                 0
BLOCK                        0
LOT                          0
ZIP_CODE                     0
TOTAL_UNITS                  0
GROSS_SQUARE_FEET            0
YEAR_BUILT                   0
TAX_CLASS_AT_TIME_OF_SALE    0
dtype: int64

In [16]:
import category_encoders as ce
encoder = ce.OneHotEncoder(use_cat_names=True)
X_train = encoder.fit_transform(X_train)
X_train

Unnamed: 0,BOROUGH_3,BOROUGH_4,BOROUGH_2,BOROUGH_5,BOROUGH_1,NEIGHBORHOOD_OTHER,NEIGHBORHOOD_FLUSHING-NORTH,NEIGHBORHOOD_EAST NEW YORK,NEIGHBORHOOD_BEDFORD STUYVESANT,NEIGHBORHOOD_FOREST HILLS,NEIGHBORHOOD_BOROUGH PARK,NEIGHBORHOOD_ASTORIA,BLOCK,LOT,ZIP_CODE,TOTAL_UNITS,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE
44,1,0,0,0,0,1,0,0,0,0,0,0,5495,801,11230.0,1.0,1325.0,1930.0,1
61,0,1,0,0,0,1,0,0,0,0,0,0,7918,72,11427.0,1.0,2001.0,1940.0,1
78,0,0,1,0,0,1,0,0,0,0,0,0,4210,19,10461.0,1.0,2043.0,1925.0,1
108,1,0,0,0,0,1,0,0,0,0,0,0,5212,69,11226.0,1.0,2680.0,1899.0,1
111,1,0,0,0,0,1,0,0,0,0,0,0,7930,121,11203.0,1.0,1872.0,1940.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18129,0,0,0,1,0,1,0,0,0,0,0,0,4081,44,10306.0,1.0,921.0,1950.0,1
18130,0,0,0,1,0,1,0,0,0,0,0,0,2373,201,10314.0,1.0,2128.0,1980.0,1
18132,0,0,0,1,0,1,0,0,0,0,0,0,1132,42,10302.0,1.0,1807.0,2018.0,1
18134,0,0,0,1,0,1,0,0,0,0,0,0,3395,37,10305.0,1.0,621.0,1930.0,1


In [17]:
X_test = encoder.transform(X_test)

In [32]:
X_test.isnull().sum()

BOROUGH_3                          0
BOROUGH_4                          0
BOROUGH_2                          0
BOROUGH_5                          0
BOROUGH_1                          0
NEIGHBORHOOD_OTHER                 0
NEIGHBORHOOD_FLUSHING-NORTH        0
NEIGHBORHOOD_EAST NEW YORK         0
NEIGHBORHOOD_BEDFORD STUYVESANT    0
NEIGHBORHOOD_FOREST HILLS          0
NEIGHBORHOOD_BOROUGH PARK          0
NEIGHBORHOOD_ASTORIA               0
BLOCK                              0
LOT                                0
ZIP_CODE                           0
TOTAL_UNITS                        0
GROSS_SQUARE_FEET                  0
YEAR_BUILT                         0
TAX_CLASS_AT_TIME_OF_SALE          0
dtype: int64

In [47]:
#Do feature selection with SelectKBest.
from sklearn.feature_selection import SelectKBest, f_regression
selector = SelectKBest(score_func=f_regression, k=11)
X_train_selected = selector.fit_transform(X_train,y_train)
X_train_selected.shape

  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)


(2517, 11)

In [34]:
X_train_selected

array([[1.0000e+00, 0.0000e+00, 0.0000e+00, ..., 1.1230e+04, 1.0000e+00,
        1.3250e+03],
       [0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 1.1427e+04, 1.0000e+00,
        2.0010e+03],
       [0.0000e+00, 1.0000e+00, 0.0000e+00, ..., 1.0461e+04, 1.0000e+00,
        2.0430e+03],
       ...,
       [0.0000e+00, 0.0000e+00, 1.0000e+00, ..., 1.0302e+04, 1.0000e+00,
        1.8070e+03],
       [0.0000e+00, 0.0000e+00, 1.0000e+00, ..., 1.0305e+04, 1.0000e+00,
        6.2100e+02],
       [0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 1.1429e+04, 1.0000e+00,
        1.1630e+03]])

In [35]:
X_test_selected = selector.transform(X_test)
X_test_selected.shape

(647, 11)

In [36]:
selected_mask = selector.get_support()
all_names = X_train.columns
selected_names = all_names[selected_mask]
unselected_names = all_names[~selected_mask]
print('Features Selected:')
for name in selected_names:
    print(name)
print('\n')
print('Features not Selected:')
for name in unselected_names:
    print(name)

Features Selected:
BOROUGH_3
BOROUGH_2
BOROUGH_5
NEIGHBORHOOD_OTHER
NEIGHBORHOOD_FLUSHING-NORTH
NEIGHBORHOOD_FOREST HILLS
NEIGHBORHOOD_BOROUGH PARK
BLOCK
ZIP_CODE
TOTAL_UNITS
GROSS_SQUARE_FEET


Features not Selected:
BOROUGH_4
BOROUGH_1
NEIGHBORHOOD_EAST NEW YORK
NEIGHBORHOOD_BEDFORD STUYVESANT
NEIGHBORHOOD_ASTORIA
LOT
YEAR_BUILT
TAX_CLASS_AT_TIME_OF_SALE


In [43]:
#Fit a ridge regression model with multiple features. Use the normalize=True parameter 
#(or do feature scaling beforehand — use the scaler's fit_transform method with the train set, 
#and the scaler's transform method with the test set)
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

for k in range(1,len(X_train.columns)+1):
    print(f'{k} features')
    selector = SelectKBest(score_func=f_regression,k=k)
    X_train_selected = selector.fit_transform(X_train, y_train)
    X_test_selected = selector.transform(X_test)
    model = LinearRegression()
    model.fit(X_train_selected, y_train)
    y_pred = model.predict(X_test_selected)
    mae = mean_absolute_error(y_test,y_pred)
    print("Test MAE", mae)
    print('\n')

1 features
Test MAE 185788.22229822693


2 features
Test MAE 186656.2341384068


3 features
Test MAE 185097.0897087548


4 features
Test MAE 179646.6901280586


5 features
Test MAE 180190.16092437741


6 features
Test MAE 179774.42337399928


7 features
Test MAE 174929.59665089322


8 features
Test MAE 166934.27703921686


9 features
Test MAE 166404.9316233743


10 features
Test MAE 165527.55857209148


11 features
Test MAE 161236.97260770455


12 features
Test MAE 162143.08124578436


13 features
Test MAE 162137.0486174425


14 features
Test MAE 162133.8021256971


15 features
Test MAE 162160.03455099178


16 features
Test MAE 162111.59267437286


17 features
Test MAE 162111.5926743689


18 features
Test MAE 162111.59267436888


19 features
Test MAE

  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freed

 162111.5926743689




It looks like 11 features is the magic number. Not sure what the run-time errors are on this. I get it when I run it on my personal enviornment. I'll tinker later, it doesn't seem to be tainting my results yet.

In [31]:
#Going to try this ridge regression.
from IPython.display import display, HTML
from sklearn.linear_model import RidgeCV, Ridge
import matplotlib
import matplotlib.pyplot as plt
matplotlib.use('PS')
for alpha in [0.001, 0.01, .1, 1, 100]:
    display(HTML(f"Ridge Regression: Alpha={alpha}"))
    model = Ridge(alpha=alpha, normalize=True)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mae = mean_absolute_error(y_test, y_pred)
    display(HTML(f'Test Mean Absolute Error:${mae:,.2f}'))
    coefficients = pd.Series(model.coef_,X_train.columns)
    plt.figure(figsize=(16,8))
    coefficients.sort_values().plot.barh(color='grey')
    plt.xlim(-400,700)
    plt.show()











In [None]:
# Get mean absolute error for the test set.