<a href="https://colab.research.google.com/github/AnthonyJFeola/DS-Unit-2-Linear-Models/blob/master/module3-ridge-regression/LS_DS_213_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 3*

---

# Ridge Regression

## Assignment

We're going back to our other **New York City** real estate dataset. Instead of predicting apartment rents, you'll predict property sales prices.

But not just for condos in Tribeca...

- [ ] Use a subset of the data where `BUILDING_CLASS_CATEGORY` == `'01 ONE FAMILY DWELLINGS'` and the sale price was more than 100 thousand and less than 2 million.
- [ ] Do train/test split. Use data from January — March 2019 to train. Use data from April 2019 to test.
- [ ] Do one-hot encoding of categorical features.
- [ ] Do feature selection with `SelectKBest`.
- [ ] Fit a ridge regression model with multiple features. Use the `normalize=True` parameter (or do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html) beforehand — use the scaler's `fit_transform` method with the train set, and the scaler's `transform` method with the test set)
- [ ] Get mean absolute error for the test set.
- [ ] As always, commit your notebook to your fork of the GitHub repo.

The [NYC Department of Finance](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page) has a glossary of property sales terms and NYC Building Class Code Descriptions. The data comes from the [NYC OpenData](https://data.cityofnewyork.us/browse?q=NYC%20calendar%20sales) portal.


## Stretch Goals

Don't worry, you aren't expected to do all these stretch goals! These are just ideas to consider and choose from.

- [ ] Add your own stretch goal(s) !
- [ ] Instead of `Ridge`, try `LinearRegression`. Depending on how many features you select, your errors will probably blow up! 💥
- [ ] Instead of `Ridge`, try [`RidgeCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html).
- [ ] Learn more about feature selection:
    - ["Permutation importance"](https://www.kaggle.com/dansbecker/permutation-importance)
    - [scikit-learn's User Guide for Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html)
    - [mlxtend](http://rasbt.github.io/mlxtend/) library
    - scikit-learn-contrib libraries: [boruta_py](https://github.com/scikit-learn-contrib/boruta_py) & [stability-selection](https://github.com/scikit-learn-contrib/stability-selection)
    - [_Feature Engineering and Selection_](http://www.feat.engineering/) by Kuhn & Johnson.
- [ ] Try [statsmodels](https://www.statsmodels.org/stable/index.html) if you’re interested in more inferential statistical approach to linear regression and feature selection, looking at p values and 95% confidence intervals for the coefficients.
- [ ] Read [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf), Chapters 1-3, for more math & theory, but in an accessible, readable way.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import pandas as pd
import pandas_profiling

# Read New York City property sales data
df = pd.read_csv(DATA_PATH+'condos/NYC_Citywide_Rolling_Calendar_Sales.csv')

# Change column names: replace spaces with underscores
df.columns = [col.replace(' ', '_') for col in df]

# SALE_PRICE was read as strings.
# Remove symbols, convert to integer
df['SALE_PRICE'] = (
    df['SALE_PRICE']
    .str.replace('$','')
    .str.replace('-','')
    .str.replace(',','')
    .astype(int)
)

In [0]:
# BOROUGH is a numeric column, but arguably should be a categorical feature,
# so convert it from a number to a string
df['BOROUGH'] = df['BOROUGH'].astype(str)

In [0]:
# Reduce cardinality for NEIGHBORHOOD feature

# Get a list of the top 10 neighborhoods
top10 = df['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10, 
# replace the neighborhood with 'OTHER'
df.loc[~df['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'

In [0]:
import warnings
warnings.filterwarnings('ignore')

In [6]:
df.head(5)

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
0,1,OTHER,13 CONDOS - ELEVATOR APARTMENTS,2,716,1246,,R4,"447 WEST 18TH STREET, PH12A",PH12A,10011.0,1.0,0.0,1.0,10733,1979.0,2007.0,2,R4,0,01/01/2019
1,1,OTHER,21 OFFICE BUILDINGS,4,812,68,,O5,144 WEST 37TH STREET,,10018.0,0.0,6.0,6.0,2962,15435.0,1920.0,4,O5,0,01/01/2019
2,1,OTHER,21 OFFICE BUILDINGS,4,839,69,,O5,40 WEST 38TH STREET,,10018.0,0.0,7.0,7.0,2074,11332.0,1930.0,4,O5,0,01/01/2019
3,1,OTHER,13 CONDOS - ELEVATOR APARTMENTS,2,592,1041,,R4,"1 SHERIDAN SQUARE, 8C",8C,10014.0,1.0,0.0,1.0,0,500.0,0.0,2,R4,0,01/01/2019
4,1,UPPER EAST SIDE (59-79),15 CONDOS - 2-10 UNIT RESIDENTIAL,2C,1379,1402,,R1,"20 EAST 65TH STREET, B",B,10065.0,1.0,0.0,1.0,0,6406.0,0.0,2,R1,0,01/01/2019


In [0]:
df = df[(df['BUILDING_CLASS_CATEGORY'] == '01 ONE FAMILY DWELLINGS') & (df['SALE_PRICE'] > 100000) & (df['SALE_PRICE'] < 2000000)]

In [0]:
df['SALE_DATE'] = pd.to_datetime(df['SALE_DATE'], infer_datetime_format=True)
df['LAND_SQUARE_FEET'] = (df['LAND_SQUARE_FEET'] !='n').astype(int)

In [0]:
df['SALE_YEAR'] = df['SALE_DATE'].dt.year
df['SALE_MONTH'] = df['SALE_DATE'].dt.month

In [0]:
train = df[(df['SALE_MONTH'] == 1) | (df['SALE_MONTH'] == 2) | (df['SALE_MONTH'] == 3)]
test = df[df['SALE_MONTH'] == 4]
train.reset_index(drop=True, inplace=True)
test.reset_index(drop=True, inplace=True)

In [0]:
from pandas_profiling import ProfileReport

In [0]:
train_profile = ProfileReport(train)
test_profile = ProfileReport(test)

In [13]:
train_profile

0,1
Number of variables,23
Number of observations,2507
Total Missing (%),8.7%
Total size in memory,450.6 KiB
Average record size in memory,184.1 B

0,1
Numeric,9
Categorical,7
Boolean,1
Date,1
Text (Unique),0
Rejected,5
Unsupported,0

0,1
Distinct count,5
Unique (%),0.2%
Missing (%),0.0%
Missing (n),0

0,1
4,1204
5,662
3,398
Other values (2),243

Value,Count,Frequency (%),Unnamed: 3
4,1204,48.0%,
5,662,26.4%,
3,398,15.9%,
2,242,9.7%,
1,1,0.0%,

0,1
Distinct count,6
Unique (%),0.2%
Missing (%),0.0%
Missing (n),0

0,1
OTHER,2382
FLUSHING-NORTH,77
FOREST HILLS,17
Other values (3),31

Value,Count,Frequency (%),Unnamed: 3
OTHER,2382,95.0%,
FLUSHING-NORTH,77,3.1%,
FOREST HILLS,17,0.7%,
BOROUGH PARK,12,0.5%,
ASTORIA,11,0.4%,
BEDFORD STUYVESANT,8,0.3%,

0,1
Constant value,01 ONE FAMILY DWELLINGS

0,1
Distinct count,2
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0

0,1
1,2476
1D,31

Value,Count,Frequency (%),Unnamed: 3
1,2476,98.8%,
1D,31,1.2%,

0,1
Distinct count,2060
Unique (%),82.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,6758.3
Minimum,21
Maximum,16350
Zeros (%),0.0%

0,1
Minimum,21.0
5-th percentile,764.8
Q1,3837.5
Median,6022.0
Q3,9888.5
95-th percentile,13261.0
Maximum,16350.0
Range,16329.0
Interquartile range,6051.0

0,1
Standard deviation,3975.9
Coef of variation,0.5883
Kurtosis,-0.68943
Mean,6758.3
MAD,3294.4
Skewness,0.35277
Sum,16943068
Variance,15808000
Memory size,19.7 KiB

Value,Count,Frequency (%),Unnamed: 3
16350,17,0.7%,
1272,7,0.3%,
5735,6,0.2%,
5506,5,0.2%,
6022,5,0.2%,
4898,5,0.2%,
3905,5,0.2%,
7008,5,0.2%,
752,4,0.2%,
150,4,0.2%,

Value,Count,Frequency (%),Unnamed: 3
21,1,0.0%,
24,1,0.0%,
54,1,0.0%,
61,1,0.0%,
64,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
16299,1,0.0%,
16305,1,0.0%,
16312,1,0.0%,
16340,4,0.2%,
16350,17,0.7%,

0,1
Distinct count,304
Unique (%),12.1%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,75.778
Minimum,1
Maximum,2720
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,5
Q1,21
Median,42
Q3,70
95-th percentile,248
Maximum,2720
Range,2719
Interquartile range,49

0,1
Standard deviation,157.53
Coef of variation,2.0788
Kurtosis,100.15
Mean,75.778
MAD,65.615
Skewness,8.5019
Sum,189976
Variance,24816
Memory size,19.7 KiB

Value,Count,Frequency (%),Unnamed: 3
20,46,1.8%,
1,45,1.8%,
19,43,1.7%,
7,40,1.6%,
13,38,1.5%,
40,37,1.5%,
10,37,1.5%,
29,36,1.4%,
14,36,1.4%,
26,36,1.4%,

Value,Count,Frequency (%),Unnamed: 3
1,45,1.8%,
2,9,0.4%,
3,26,1.0%,
4,28,1.1%,
5,30,1.2%,

Value,Count,Frequency (%),Unnamed: 3
1560,1,0.0%,
1792,1,0.0%,
2056,1,0.0%,
2686,1,0.0%,
2720,1,0.0%,

0,1
Constant value,

0,1
Distinct count,13
Unique (%),0.5%
Missing (%),0.0%
Missing (n),0

0,1
A1,919
A5,779
A2,411
Other values (10),398

Value,Count,Frequency (%),Unnamed: 3
A1,919,36.7%,
A5,779,31.1%,
A2,411,16.4%,
A9,193,7.7%,
A0,67,2.7%,
S1,39,1.6%,
A3,38,1.5%,
A8,31,1.2%,
A6,14,0.6%,
A4,13,0.5%,

0,1
Distinct count,2497
Unique (%),99.6%
Missing (%),0.0%
Missing (n),0

0,1
294 FREEBORN STREET,2
130-52 LEFFERTS BOULEVARD,2
216-29 114TH ROAD,2
Other values (2494),2501

Value,Count,Frequency (%),Unnamed: 3
294 FREEBORN STREET,2,0.1%,
130-52 LEFFERTS BOULEVARD,2,0.1%,
216-29 114TH ROAD,2,0.1%,
125-27 LUCAS STREET,2,0.1%,
118-20 202ND STREET,2,0.1%,
104-18 187TH STREET,2,0.1%,
57 CHESTNUT STREET,2,0.1%,
33 BAILEY PLACE,2,0.1%,
117-45 125TH STREET,2,0.1%,
22-40 93RD STREET,2,0.1%,

0,1
Distinct count,2
Unique (%),0.1%
Missing (%),100.0%
Missing (n),2506

0,1
RP.,1
(Missing),2506

Value,Count,Frequency (%),Unnamed: 3
RP.,1,0.0%,
(Missing),2506,100.0%,

0,1
Distinct count,122
Unique (%),4.9%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,10993
Minimum,10301
Maximum,11697
Zeros (%),0.0%

0,1
Minimum,10301
5-th percentile,10304
Q1,10314
Median,11234
Q3,11413
95-th percentile,11434
Maximum,11697
Range,1396
Interquartile range,1099

0,1
Standard deviation,494.29
Coef of variation,0.044963
Kurtosis,-1.5846
Mean,10993
MAD,464.36
Skewness,-0.49029
Sum,27560000
Variance,244320
Memory size,19.7 KiB

Value,Count,Frequency (%),Unnamed: 3
10312.0,115,4.6%,
10306.0,113,4.5%,
10314.0,95,3.8%,
11434.0,77,3.1%,
11413.0,69,2.8%,
11234.0,66,2.6%,
11412.0,65,2.6%,
10304.0,61,2.4%,
10305.0,55,2.2%,
10465.0,52,2.1%,

Value,Count,Frequency (%),Unnamed: 3
10301.0,22,0.9%,
10302.0,29,1.2%,
10303.0,40,1.6%,
10304.0,61,2.4%,
10305.0,55,2.2%,

Value,Count,Frequency (%),Unnamed: 3
11691.0,17,0.7%,
11692.0,6,0.2%,
11693.0,6,0.2%,
11694.0,10,0.4%,
11697.0,21,0.8%,

0,1
Distinct count,2
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.98763

0,1
1.0,2476
0.0,31

Value,Count,Frequency (%),Unnamed: 3
1.0,2476,98.8%,
0.0,31,1.2%,

0,1
Distinct count,3
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.016354
Minimum,0
Maximum,2
Zeros (%),98.4%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,0
95-th percentile,0
Maximum,2
Range,2
Interquartile range,0

0,1
Standard deviation,0.12997
Coef of variation,7.947
Kurtosis,69.89
Mean,0.016354
MAD,0.032187
Skewness,8.1703
Sum,41
Variance,0.016891
Memory size,19.7 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,2467,98.4%,
1.0,39,1.6%,
2.0,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0.0,2467,98.4%,
1.0,39,1.6%,
2.0,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0.0,2467,98.4%,
1.0,39,1.6%,
2.0,1,0.0%,

0,1
Distinct count,4
Unique (%),0.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1.004
Minimum,0
Maximum,3
Zeros (%),1.2%

0,1
Minimum,0
5-th percentile,1
Q1,1
Median,1
Q3,1
95-th percentile,1
Maximum,3
Range,3
Interquartile range,0

0,1
Standard deviation,0.17179
Coef of variation,0.17111
Kurtosis,36.376
Mean,1.004
MAD,0.032581
Skewness,1.1905
Sum,2517
Variance,0.029513
Memory size,19.7 KiB

Value,Count,Frequency (%),Unnamed: 3
1.0,2436,97.2%,
2.0,39,1.6%,
0.0,31,1.2%,
3.0,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0.0,31,1.2%,
1.0,2436,97.2%,
2.0,39,1.6%,
3.0,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0.0,31,1.2%,
1.0,2436,97.2%,
2.0,39,1.6%,
3.0,1,0.0%,

0,1
Constant value,1

0,1
Distinct count,922
Unique (%),36.8%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1473.7
Minimum,0
Maximum,7875
Zeros (%),1.2%

0,1
Minimum,0.0
5-th percentile,816.0
Q1,1144.0
Median,1368.0
Q3,1683.0
95-th percentile,2481.6
Maximum,7875.0
Range,7875.0
Interquartile range,539.0

0,1
Standard deviation,599.22
Coef of variation,0.4066
Kurtosis,16.858
Mean,1473.7
MAD,400.61
Skewness,2.5181
Sum,3694700
Variance,359060
Memory size,19.7 KiB

Value,Count,Frequency (%),Unnamed: 3
1440.0,41,1.6%,
1280.0,32,1.3%,
1296.0,31,1.2%,
1200.0,31,1.2%,
0.0,31,1.2%,
960.0,26,1.0%,
1120.0,25,1.0%,
1400.0,23,0.9%,
1224.0,23,0.9%,
1600.0,23,0.9%,

Value,Count,Frequency (%),Unnamed: 3
0.0,31,1.2%,
375.0,1,0.0%,
425.0,1,0.0%,
436.0,1,0.0%,
448.0,2,0.1%,

Value,Count,Frequency (%),Unnamed: 3
5184.0,1,0.0%,
5348.0,1,0.0%,
7200.0,1,0.0%,
7500.0,1,0.0%,
7875.0,1,0.0%,

0,1
Distinct count,86
Unique (%),3.4%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1944.8
Minimum,1890
Maximum,2018
Zeros (%),0.0%

0,1
Minimum,1890
5-th percentile,1910
Q1,1925
Median,1940
Q3,1960
95-th percentile,2002
Maximum,2018
Range,128
Interquartile range,35

0,1
Standard deviation,27.059
Coef of variation,0.013914
Kurtosis,0.2273
Mean,1944.8
MAD,21.35
Skewness,0.89025
Sum,4875500
Variance,732.21
Memory size,19.7 KiB

Value,Count,Frequency (%),Unnamed: 3
1920.0,290,11.6%,
1925.0,285,11.4%,
1930.0,246,9.8%,
1950.0,215,8.6%,
1940.0,192,7.7%,
1945.0,145,5.8%,
1960.0,124,4.9%,
1935.0,120,4.8%,
1955.0,108,4.3%,
1910.0,78,3.1%,

Value,Count,Frequency (%),Unnamed: 3
1890.0,1,0.0%,
1899.0,38,1.5%,
1900.0,2,0.1%,
1901.0,25,1.0%,
1905.0,9,0.4%,

Value,Count,Frequency (%),Unnamed: 3
2014.0,1,0.0%,
2015.0,2,0.1%,
2016.0,7,0.3%,
2017.0,13,0.5%,
2018.0,36,1.4%,

0,1
Constant value,1

0,1
Distinct count,11
Unique (%),0.4%
Missing (%),0.0%
Missing (n),0

0,1
A1,919
A5,779
A2,413
Other values (8),396

Value,Count,Frequency (%),Unnamed: 3
A1,919,36.7%,
A5,779,31.1%,
A2,413,16.5%,
A9,193,7.7%,
A0,67,2.7%,
S1,39,1.6%,
A3,38,1.5%,
A8,31,1.2%,
A6,14,0.6%,
A4,13,0.5%,

0,1
Distinct count,880
Unique (%),35.1%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,621570
Minimum,104000
Maximum,1955000
Zeros (%),0.0%

0,1
Minimum,104000
5-th percentile,250000
Q1,440500
Median,560000
Q3,750000
95-th percentile,1194000
Maximum,1955000
Range,1851000
Interquartile range,309500

0,1
Standard deviation,291610
Coef of variation,0.46914
Kurtosis,2.778
Mean,621570
MAD,214720
Skewness,1.3854
Sum,1558285372
Variance,85035000000
Memory size,19.7 KiB

Value,Count,Frequency (%),Unnamed: 3
500000,37,1.5%,
550000,36,1.4%,
450000,33,1.3%,
525000,31,1.2%,
600000,29,1.2%,
400000,26,1.0%,
650000,26,1.0%,
700000,26,1.0%,
490000,23,0.9%,
800000,23,0.9%,

Value,Count,Frequency (%),Unnamed: 3
104000,1,0.0%,
105000,1,0.0%,
108000,1,0.0%,
110000,1,0.0%,
112000,2,0.1%,

Value,Count,Frequency (%),Unnamed: 3
1876000,1,0.0%,
1900000,1,0.0%,
1925000,1,0.0%,
1950000,1,0.0%,
1955000,1,0.0%,

0,1
Distinct count,68
Unique (%),2.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Minimum,2019-01-01 00:00:00
Maximum,2019-03-30 00:00:00

0,1
Constant value,2019

0,1
Distinct count,3
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1.941
Minimum,1
Maximum,3
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,1
Median,2
Q3,3
95-th percentile,3
Maximum,3
Range,2
Interquartile range,2

0,1
Standard deviation,0.83261
Coef of variation,0.42897
Kurtosis,-1.5491
Mean,1.941
MAD,0.71088
Skewness,0.11084
Sum,4866
Variance,0.69324
Memory size,19.7 KiB

Value,Count,Frequency (%),Unnamed: 3
1,947,37.8%,
3,799,31.9%,
2,761,30.4%,

Value,Count,Frequency (%),Unnamed: 3
1,947,37.8%,
2,761,30.4%,
3,799,31.9%,

Value,Count,Frequency (%),Unnamed: 3
1,947,37.8%,
2,761,30.4%,
3,799,31.9%,

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE,SALE_YEAR,SALE_MONTH
0,3,OTHER,01 ONE FAMILY DWELLINGS,1,5495,801,,A9,4832 BAY PARKWAY,,11230.0,1.0,0.0,1.0,1,1325.0,1930.0,1,A9,550000,2019-01-01,2019,1
1,4,OTHER,01 ONE FAMILY DWELLINGS,1,7918,72,,A1,80-23 232ND STREET,,11427.0,1.0,0.0,1.0,1,2001.0,1940.0,1,A1,200000,2019-01-01,2019,1
2,2,OTHER,01 ONE FAMILY DWELLINGS,1,4210,19,,A1,1260 RHINELANDER AVE,,10461.0,1.0,0.0,1.0,1,2043.0,1925.0,1,A1,810000,2019-01-02,2019,1
3,3,OTHER,01 ONE FAMILY DWELLINGS,1,5212,69,,A1,469 E 25TH ST,,11226.0,1.0,0.0,1.0,1,2680.0,1899.0,1,A1,125000,2019-01-02,2019,1
4,3,OTHER,01 ONE FAMILY DWELLINGS,1,7930,121,,A5,5521 WHITTY LANE,,11203.0,1.0,0.0,1.0,1,1872.0,1940.0,1,A5,620000,2019-01-02,2019,1


In [14]:
test_profile

0,1
Number of variables,23
Number of observations,644
Total Missing (%),8.7%
Total size in memory,115.8 KiB
Average record size in memory,184.2 B

0,1
Numeric,8
Categorical,6
Boolean,1
Date,1
Text (Unique),0
Rejected,7
Unsupported,0

0,1
Distinct count,5
Unique (%),0.8%
Missing (%),0.0%
Missing (n),0

0,1
4,376
3,139
5,76
Other values (2),53

Value,Count,Frequency (%),Unnamed: 3
4,376,58.4%,
3,139,21.6%,
5,76,11.8%,
2,51,7.9%,
1,2,0.3%,

0,1
Distinct count,6
Unique (%),0.9%
Missing (%),0.0%
Missing (n),0

0,1
OTHER,608
FLUSHING-NORTH,20
BOROUGH PARK,7
Other values (3),9

Value,Count,Frequency (%),Unnamed: 3
OTHER,608,94.4%,
FLUSHING-NORTH,20,3.1%,
BOROUGH PARK,7,1.1%,
FOREST HILLS,5,0.8%,
ASTORIA,3,0.5%,
BEDFORD STUYVESANT,1,0.2%,

0,1
Constant value,01 ONE FAMILY DWELLINGS

0,1
Distinct count,2
Unique (%),0.3%
Missing (%),0.0%
Missing (n),0

0,1
1,635
1D,9

Value,Count,Frequency (%),Unnamed: 3
1,635,98.6%,
1D,9,1.4%,

0,1
Distinct count,610
Unique (%),94.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,7539.6
Minimum,107
Maximum,16350
Zeros (%),0.0%

0,1
Minimum,107.0
5-th percentile,1255.9
Q1,4500.0
Median,7393.0
Q3,10858.0
95-th percentile,13504.0
Maximum,16350.0
Range,16243.0
Interquartile range,6357.5

0,1
Standard deviation,3854.4
Coef of variation,0.51123
Kurtosis,-0.78097
Mean,7539.6
MAD,3227.7
Skewness,0.12641
Sum,4855476
Variance,14857000
Memory size,5.2 KiB

Value,Count,Frequency (%),Unnamed: 3
16350,4,0.6%,
7895,3,0.5%,
1331,3,0.5%,
11044,2,0.3%,
3110,2,0.3%,
728,2,0.3%,
1652,2,0.3%,
3309,2,0.3%,
1991,2,0.3%,
4452,2,0.3%,

Value,Count,Frequency (%),Unnamed: 3
107,1,0.2%,
223,2,0.3%,
231,1,0.2%,
258,1,0.2%,
275,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
16050,1,0.2%,
16212,1,0.2%,
16243,1,0.2%,
16340,2,0.3%,
16350,4,0.6%,

0,1
Distinct count,172
Unique (%),26.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,76.773
Minimum,1
Maximum,2202
Zeros (%),0.0%

0,1
Minimum,1.0
5-th percentile,5.0
Q1,21.0
Median,41.0
Q3,68.0
95-th percentile,241.55
Maximum,2202.0
Range,2201.0
Interquartile range,47.0

0,1
Standard deviation,174.38
Coef of variation,2.2714
Kurtosis,87.082
Mean,76.773
MAD,68.263
Skewness,8.4769
Sum,49442
Variance,30408
Memory size,5.2 KiB

Value,Count,Frequency (%),Unnamed: 3
31,16,2.5%,
19,13,2.0%,
50,13,2.0%,
38,13,2.0%,
28,12,1.9%,
20,12,1.9%,
30,12,1.9%,
9,11,1.7%,
27,11,1.7%,
24,11,1.7%,

Value,Count,Frequency (%),Unnamed: 3
1,8,1.2%,
2,2,0.3%,
3,8,1.2%,
4,10,1.6%,
5,7,1.1%,

Value,Count,Frequency (%),Unnamed: 3
1202,1,0.2%,
1214,1,0.2%,
2008,1,0.2%,
2022,1,0.2%,
2202,1,0.2%,

0,1
Constant value,

0,1
Distinct count,11
Unique (%),1.7%
Missing (%),0.0%
Missing (n),0

0,1
A1,266
A5,206
A2,80
Other values (8),92

Value,Count,Frequency (%),Unnamed: 3
A1,266,41.3%,
A5,206,32.0%,
A2,80,12.4%,
A9,46,7.1%,
A0,18,2.8%,
A8,9,1.4%,
S1,9,1.4%,
A3,5,0.8%,
A4,3,0.5%,
B2,1,0.2%,

0,1
Distinct count,643
Unique (%),99.8%
Missing (%),0.0%
Missing (n),0

0,1
46-12 30TH ROAD,2
158-20 81ST STREET,1
914 E 19TH STREET,1
Other values (640),640

Value,Count,Frequency (%),Unnamed: 3
46-12 30TH ROAD,2,0.3%,
158-20 81ST STREET,1,0.2%,
914 E 19TH STREET,1,0.2%,
120-29 171ST STREET,1,0.2%,
115-93 227TH STREET,1,0.2%,
102 DALE AVENUE,1,0.2%,
89-30 121 STREET,1,0.2%,
183-01 HENDERSON AVE,1,0.2%,
144-31 70TH ROAD,1,0.2%,
53 MC VEIGH AVENUE,1,0.2%,

0,1
Constant value,

0,1
Distinct count,108
Unique (%),16.8%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,11159
Minimum,10030
Maximum,11697
Zeros (%),0.0%

0,1
Minimum,10030
5-th percentile,10306
Q1,11208
Median,11362
Q3,11419
95-th percentile,11435
Maximum,11697
Range,1667
Interquartile range,211

0,1
Standard deviation,410.16
Coef of variation,0.036757
Kurtosis,0.10852
Mean,11159
MAD,320.34
Skewness,-1.3024
Sum,7186300
Variance,168230
Memory size,5.2 KiB

Value,Count,Frequency (%),Unnamed: 3
11234.0,24,3.7%,
11412.0,24,3.7%,
11434.0,23,3.6%,
11420.0,21,3.3%,
10314.0,18,2.8%,
11229.0,17,2.6%,
11413.0,17,2.6%,
10306.0,14,2.2%,
11433.0,14,2.2%,
11423.0,12,1.9%,

Value,Count,Frequency (%),Unnamed: 3
10030.0,1,0.2%,
10301.0,2,0.3%,
10302.0,3,0.5%,
10303.0,3,0.5%,
10304.0,4,0.6%,

Value,Count,Frequency (%),Unnamed: 3
11691.0,6,0.9%,
11692.0,2,0.3%,
11693.0,2,0.3%,
11694.0,2,0.3%,
11697.0,6,0.9%,

0,1
Distinct count,3
Unique (%),0.5%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.98758
Minimum,0
Maximum,2
Zeros (%),1.4%

0,1
Minimum,0
5-th percentile,1
Q1,1
Median,1
Q3,1
95-th percentile,1
Maximum,2
Range,2
Interquartile range,0

0,1
Standard deviation,0.12409
Coef of variation,0.12565
Kurtosis,60.627
Mean,0.98758
MAD,0.027603
Skewness,-6.2298
Sum,636
Variance,0.015398
Memory size,5.2 KiB

Value,Count,Frequency (%),Unnamed: 3
1.0,634,98.4%,
0.0,9,1.4%,
2.0,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
0.0,9,1.4%,
1.0,634,98.4%,
2.0,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
0.0,9,1.4%,
1.0,634,98.4%,
2.0,1,0.2%,

0,1
Distinct count,2
Unique (%),0.3%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.013975

0,1
0.0,635
1.0,9

Value,Count,Frequency (%),Unnamed: 3
0.0,635,98.6%,
1.0,9,1.4%,

0,1
Distinct count,3
Unique (%),0.5%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1.0016
Minimum,0
Maximum,2
Zeros (%),1.4%

0,1
Minimum,0
5-th percentile,1
Q1,1
Median,1
Q3,1
95-th percentile,1
Maximum,2
Range,2
Interquartile range,0

0,1
Standard deviation,0.17189
Coef of variation,0.17162
Kurtosis,31.14
Mean,1.0016
MAD,0.031008
Skewness,0.27998
Sum,645
Variance,0.029547
Memory size,5.2 KiB

Value,Count,Frequency (%),Unnamed: 3
1.0,625,97.0%,
2.0,10,1.6%,
0.0,9,1.4%,

Value,Count,Frequency (%),Unnamed: 3
0.0,9,1.4%,
1.0,625,97.0%,
2.0,10,1.6%,

Value,Count,Frequency (%),Unnamed: 3
0.0,9,1.4%,
1.0,625,97.0%,
2.0,10,1.6%,

0,1
Constant value,1

0,1
Distinct count,401
Unique (%),62.3%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1456.9
Minimum,0
Maximum,4026
Zeros (%),1.4%

0,1
Minimum,0.0
5-th percentile,850.15
Q1,1150.8
Median,1344.0
Q3,1685.5
95-th percentile,2439.7
Maximum,4026.0
Range,4026.0
Interquartile range,534.75

0,1
Standard deviation,533.48
Coef of variation,0.36617
Kurtosis,2.923
Mean,1456.9
MAD,388.67
Skewness,1.0378
Sum,938260
Variance,284600
Memory size,5.2 KiB

Value,Count,Frequency (%),Unnamed: 3
1224.0,10,1.6%,
1216.0,10,1.6%,
0.0,9,1.4%,
1296.0,9,1.4%,
1344.0,8,1.2%,
1520.0,7,1.1%,
1280.0,7,1.1%,
1152.0,6,0.9%,
1080.0,6,0.9%,
1568.0,6,0.9%,

Value,Count,Frequency (%),Unnamed: 3
0.0,9,1.4%,
280.0,1,0.2%,
350.0,1,0.2%,
448.0,1,0.2%,
500.0,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
3416.0,1,0.2%,
3426.0,1,0.2%,
3600.0,1,0.2%,
3690.0,1,0.2%,
4026.0,1,0.2%,

0,1
Distinct count,51
Unique (%),7.9%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1939.5
Minimum,1899
Maximum,2018
Zeros (%),0.0%

0,1
Minimum,1899
5-th percentile,1910
Q1,1925
Median,1930
Q3,1950
95-th percentile,1996
Maximum,2018
Range,119
Interquartile range,25

0,1
Standard deviation,24.713
Coef of variation,0.012742
Kurtosis,1.6028
Mean,1939.5
MAD,18.56
Skewness,1.3288
Sum,1249100
Variance,610.72
Memory size,5.2 KiB

Value,Count,Frequency (%),Unnamed: 3
1930.0,93,14.4%,
1920.0,92,14.3%,
1925.0,89,13.8%,
1950.0,50,7.8%,
1940.0,50,7.8%,
1935.0,42,6.5%,
1960.0,26,4.0%,
1910.0,24,3.7%,
1955.0,23,3.6%,
1945.0,22,3.4%,

Value,Count,Frequency (%),Unnamed: 3
1899.0,6,0.9%,
1901.0,8,1.2%,
1905.0,5,0.8%,
1910.0,24,3.7%,
1915.0,13,2.0%,

Value,Count,Frequency (%),Unnamed: 3
2005.0,1,0.2%,
2006.0,3,0.5%,
2016.0,3,0.5%,
2017.0,5,0.8%,
2018.0,7,1.1%,

0,1
Constant value,1

0,1
Distinct count,10
Unique (%),1.6%
Missing (%),0.0%
Missing (n),0

0,1
A1,267
A5,206
A2,80
Other values (7),91

Value,Count,Frequency (%),Unnamed: 3
A1,267,41.5%,
A5,206,32.0%,
A2,80,12.4%,
A9,46,7.1%,
A0,18,2.8%,
A8,9,1.4%,
S1,9,1.4%,
A3,5,0.8%,
A4,3,0.5%,
A6,1,0.2%,

0,1
Distinct count,334
Unique (%),51.9%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,655760
Minimum,107500
Maximum,1912500
Zeros (%),0.0%

0,1
Minimum,107500
5-th percentile,295000
Q1,470000
Median,599300
Q3,781620
95-th percentile,1261300
Maximum,1912500
Range,1805000
Interquartile range,311620

0,1
Standard deviation,296980
Coef of variation,0.45288
Kurtosis,3.0789
Mean,655760
MAD,216600
Skewness,1.4527
Sum,422307543
Variance,88197000000
Memory size,5.2 KiB

Value,Count,Frequency (%),Unnamed: 3
625000,11,1.7%,
500000,11,1.7%,
650000,10,1.6%,
850000,9,1.4%,
700000,9,1.4%,
525000,9,1.4%,
645000,8,1.2%,
520000,8,1.2%,
530000,7,1.1%,
400000,7,1.1%,

Value,Count,Frequency (%),Unnamed: 3
107500,1,0.2%,
110000,1,0.2%,
125000,1,0.2%,
150000,1,0.2%,
158000,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
1795000,1,0.2%,
1800000,3,0.5%,
1900000,1,0.2%,
1909219,1,0.2%,
1912500,1,0.2%,

0,1
Distinct count,23
Unique (%),3.6%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Minimum,2019-04-01 00:00:00
Maximum,2019-04-30 00:00:00

0,1
Constant value,2019

0,1
Constant value,4

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE,SALE_YEAR,SALE_MONTH
0,2,OTHER,01 ONE FAMILY DWELLINGS,1,5913,878,,A1,4616 INDEPENDENCE AVENUE,,10471.0,1.0,0.0,1.0,1,2272.0,1930.0,1,A1,895000,2019-04-01,2019,4
1,2,OTHER,01 ONE FAMILY DWELLINGS,1,5488,48,,A2,558 ELLSWORTH AVENUE,,10465.0,1.0,0.0,1.0,1,720.0,1935.0,1,A2,253500,2019-04-01,2019,4
2,3,OTHER,01 ONE FAMILY DWELLINGS,1,5936,31,,A1,16 BAY RIDGE PARKWAY,,11209.0,1.0,0.0,1.0,1,2210.0,1925.0,1,A1,1300000,2019-04-01,2019,4
3,3,OTHER,01 ONE FAMILY DWELLINGS,1,7813,24,,A5,1247 EAST 40TH STREET,,11210.0,1.0,0.0,1.0,1,1520.0,1915.0,1,A5,789000,2019-04-01,2019,4
4,3,OTHER,01 ONE FAMILY DWELLINGS,1,8831,160,,A9,2314 PLUMB 2ND STREET,,11229.0,1.0,0.0,1.0,1,840.0,1925.0,1,A9,525000,2019-04-01,2019,4


In [15]:
train.describe()

Unnamed: 0,BLOCK,LOT,EASE-MENT,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_YEAR,SALE_MONTH
count,2507.0,2507.0,0.0,2507.0,2507.0,2507.0,2507.0,2507.0,2507.0,2507.0,2507.0,2507.0,2507.0,2507.0
mean,6758.303949,75.778221,,10993.398484,0.987635,0.016354,1.003989,1.0,1473.744715,1944.766653,1.0,621573.7,2019.0,1.940965
std,3975.909029,157.531138,,494.291462,0.110532,0.129966,0.171794,0.0,599.217635,27.059337,0.0,291607.2,0.0,0.832611
min,21.0,1.0,,10301.0,0.0,0.0,0.0,1.0,0.0,1890.0,1.0,104000.0,2019.0,1.0
25%,3837.5,21.0,,10314.0,1.0,0.0,1.0,1.0,1144.0,1925.0,1.0,440500.0,2019.0,1.0
50%,6022.0,42.0,,11234.0,1.0,0.0,1.0,1.0,1368.0,1940.0,1.0,560000.0,2019.0,2.0
75%,9888.5,70.0,,11413.0,1.0,0.0,1.0,1.0,1683.0,1960.0,1.0,750000.0,2019.0,3.0
max,16350.0,2720.0,,11697.0,1.0,2.0,3.0,1.0,7875.0,2018.0,1.0,1955000.0,2019.0,3.0


In [16]:
train.describe(exclude='number')

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_DATE
count,2507.0,2507,2507,2507.0,2507,2507,1,2507,2507
unique,5.0,6,1,2.0,13,2497,1,11,68
top,4.0,OTHER,01 ONE FAMILY DWELLINGS,1.0,A1,294 FREEBORN STREET,RP.,A1,2019-01-31 00:00:00
freq,1204.0,2382,2507,2476.0,919,2,1,919,78
first,,,,,,,,,2019-01-01 00:00:00
last,,,,,,,,,2019-03-30 00:00:00


In [17]:
target = 'SALE_PRICE'
high_cardinality_or_drop = ['BUILDING_CLASS_CATEGORY', 'TAX_CLASS_AT_PRESENT', 
                            'BUILDING_CLASS_AT_PRESENT', 'ADDRESS', 
                            'APARTMENT_NUMBER', 'EASE-MENT', 'COMMERCIAL_UNITS', 
                            'TAX_CLASS_AT_TIME_OF_SALE', 'SALE_YEAR', 'SALE_MONTH', 
                            'SALE_DATE', 'BUILDING_CLASS_AT_TIME_OF_SALE' ]
features = train.columns.drop([target] + high_cardinality_or_drop)
features

Index(['BOROUGH', 'NEIGHBORHOOD', 'BLOCK', 'LOT', 'ZIP_CODE',
       'RESIDENTIAL_UNITS', 'TOTAL_UNITS', 'LAND_SQUARE_FEET',
       'GROSS_SQUARE_FEET', 'YEAR_BUILT'],
      dtype='object')

In [0]:
X_train = train[features]
y_train = train[target]
X_test = test[features]
y_test = test[target]

In [19]:
X_train.head()

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BLOCK,LOT,ZIP_CODE,RESIDENTIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT
0,3,OTHER,5495,801,11230.0,1.0,1.0,1,1325.0,1930.0
1,4,OTHER,7918,72,11427.0,1.0,1.0,1,2001.0,1940.0
2,2,OTHER,4210,19,10461.0,1.0,1.0,1,2043.0,1925.0
3,3,OTHER,5212,69,11226.0,1.0,1.0,1,2680.0,1899.0
4,3,OTHER,7930,121,11203.0,1.0,1.0,1,1872.0,1940.0


In [0]:
import category_encoders as ce 

encoder = ce.one_hot.OneHotEncoder(use_cat_names=True)
X_train_enc = encoder.fit_transform(X_train)
X_test_enc = encoder.transform(X_test)

In [21]:
X_test_enc

Unnamed: 0,BOROUGH_3,BOROUGH_4,BOROUGH_2,BOROUGH_5,BOROUGH_1,NEIGHBORHOOD_OTHER,NEIGHBORHOOD_FLUSHING-NORTH,NEIGHBORHOOD_BEDFORD STUYVESANT,NEIGHBORHOOD_FOREST HILLS,NEIGHBORHOOD_BOROUGH PARK,NEIGHBORHOOD_ASTORIA,BLOCK,LOT,ZIP_CODE,RESIDENTIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT
0,0,0,1,0,0,1,0,0,0,0,0,5913,878,10471.0,1.0,1.0,1,2272.0,1930.0
1,0,0,1,0,0,1,0,0,0,0,0,5488,48,10465.0,1.0,1.0,1,720.0,1935.0
2,1,0,0,0,0,1,0,0,0,0,0,5936,31,11209.0,1.0,1.0,1,2210.0,1925.0
3,1,0,0,0,0,1,0,0,0,0,0,7813,24,11210.0,1.0,1.0,1,1520.0,1915.0
4,1,0,0,0,0,1,0,0,0,0,0,8831,160,11229.0,1.0,1.0,1,840.0,1925.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
639,0,1,0,0,0,1,0,0,0,0,0,13215,3,11422.0,1.0,1.0,1,1478.0,1925.0
640,0,1,0,0,0,1,0,0,0,0,0,11612,73,11420.0,1.0,1.0,1,1280.0,1930.0
641,0,1,0,0,0,1,0,0,0,0,0,11808,50,11420.0,1.0,1.0,1,1333.0,1945.0
642,0,1,0,0,0,1,0,0,0,0,0,12295,23,11434.0,1.0,1.0,1,1020.0,1935.0


In [0]:
from sklearn.feature_selection import SelectKBest, f_regression

selector = SelectKBest(score_func=f_regression, k=10)
X_train_kbest = selector.fit_transform(X_train_enc, y_train)
X_test_kbest = selector.transform(X_test_enc)

In [23]:
X_train_kbest.shape, X_test_kbest.shape

((2507, 10), (644, 10))

In [24]:
X_train_kbest

array([[1.0000e+00, 0.0000e+00, 0.0000e+00, ..., 1.1230e+04, 1.0000e+00,
        1.3250e+03],
       [0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 1.1427e+04, 1.0000e+00,
        2.0010e+03],
       [0.0000e+00, 1.0000e+00, 0.0000e+00, ..., 1.0461e+04, 1.0000e+00,
        2.0430e+03],
       ...,
       [0.0000e+00, 0.0000e+00, 1.0000e+00, ..., 1.0302e+04, 1.0000e+00,
        1.8070e+03],
       [0.0000e+00, 0.0000e+00, 1.0000e+00, ..., 1.0305e+04, 1.0000e+00,
        6.2100e+02],
       [0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 1.1429e+04, 1.0000e+00,
        1.1630e+03]])

In [25]:
mask = selector.get_support()
mask

array([ True, False,  True,  True, False,  True,  True, False,  True,
       False, False,  True, False,  True, False,  True, False,  True,
       False])

In [26]:
X_train_enc.columns[mask]

Index(['BOROUGH_3', 'BOROUGH_2', 'BOROUGH_5', 'NEIGHBORHOOD_OTHER',
       'NEIGHBORHOOD_FLUSHING-NORTH', 'NEIGHBORHOOD_FOREST HILLS', 'BLOCK',
       'ZIP_CODE', 'TOTAL_UNITS', 'GROSS_SQUARE_FEET'],
      dtype='object')

In [27]:
X_train_enc.columns[~mask]

Index(['BOROUGH_4', 'BOROUGH_1', 'NEIGHBORHOOD_BEDFORD STUYVESANT',
       'NEIGHBORHOOD_BOROUGH PARK', 'NEIGHBORHOOD_ASTORIA', 'LOT',
       'RESIDENTIAL_UNITS', 'LAND_SQUARE_FEET', 'YEAR_BUILT'],
      dtype='object')

In [28]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

mae_list = []
for k in range(1, X_train_enc.shape[1]+1):
  print(f'{k} features')
  selector = SelectKBest(score_func=f_regression, k=k)
  X_train_kbest = selector.fit_transform(X_train_enc, y_train)
  X_test_kbest = selector.transform(X_test_enc)
  model = LinearRegression()
  model.fit(X_train_kbest, y_train)
  y_pred = model.predict(X_test_kbest)
  mae = mean_absolute_error(y_pred, y_test)
  print(f'MAE on test set: ${mae:.2f}')
  mae_list.append(mae)

1 features
MAE on test set: $183640.59
2 features
MAE on test set: $174495.92
3 features
MAE on test set: $175142.68
4 features
MAE on test set: $173620.37
5 features
MAE on test set: $174228.72
6 features
MAE on test set: $174011.33
7 features
MAE on test set: $169628.18
8 features
MAE on test set: $170425.59
9 features
MAE on test set: $169744.84
10 features
MAE on test set: $162186.98
11 features
MAE on test set: $160380.14
12 features
MAE on test set: $160405.90
13 features
MAE on test set: $161706.15
14 features
MAE on test set: $162274.12
15 features
MAE on test set: $162294.29
16 features
MAE on test set: $162310.21
17 features
MAE on test set: $162310.21
18 features
MAE on test set: $162310.21
19 features
MAE on test set: $162310.21


In [29]:
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from IPython.display import display, HTML
for alpha in [0.001, 0.01, 0.1, 1.0, 1, 100.0, 1000.0]:
    
    # Fit Ridge Regression model
    display(HTML(f'Ridge Regression, with alpha={alpha}'))
    model = Ridge(alpha=alpha, normalize=True)
    model.fit(X_train_enc, y_train)
    y_pred = model.predict(X_test_enc)

    # Get Test MAE
    mae = mean_absolute_error(y_test, y_pred)
    display(HTML(f'Test Mean Absolute Error: ${mae:,.0f}'))
    
    # Plot coefficients
    coefficients = pd.Series(model.coef_, X_train_enc.columns)
    plt.figure(figsize=(16,8))
    coefficients.sort_values().plot.barh(color='grey')
    plt.xlim(-10000,10000)
    plt.show()