<a href="https://colab.research.google.com/github/KathyRoma/DS-Unit-2-Linear-Models/blob/master/module3-ridge-regression/LS_DS_213_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 3*

---

# Ridge Regression

## Assignment

We're going back to our other **New York City** real estate dataset. Instead of predicting apartment rents, you'll predict property sales prices.

But not just for condos in Tribeca...

- [ ] Use a subset of the data where `BUILDING_CLASS_CATEGORY` == `'01 ONE FAMILY DWELLINGS'` and the sale price was more than 100 thousand and less than 2 million.
- [ ] Do train/test split. Use data from January — March 2019 to train. Use data from April 2019 to test.
- [ ] Do one-hot encoding of categorical features.
- [ ] Do feature selection with `SelectKBest`.
- [ ] Fit a ridge regression model with multiple features. Use the `normalize=True` parameter (or do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html) beforehand — use the scaler's `fit_transform` method with the train set, and the scaler's `transform` method with the test set)
- [ ] Get mean absolute error for the test set.
- [ ] As always, commit your notebook to your fork of the GitHub repo.

The [NYC Department of Finance](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page) has a glossary of property sales terms and NYC Building Class Code Descriptions. The data comes from the [NYC OpenData](https://data.cityofnewyork.us/browse?q=NYC%20calendar%20sales) portal.


## Stretch Goals

Don't worry, you aren't expected to do all these stretch goals! These are just ideas to consider and choose from.

- [ ] Add your own stretch goal(s) !
- [ ] Instead of `Ridge`, try `LinearRegression`. Depending on how many features you select, your errors will probably blow up! 💥
- [ ] Instead of `Ridge`, try [`RidgeCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html).
- [ ] Learn more about feature selection:
    - ["Permutation importance"](https://www.kaggle.com/dansbecker/permutation-importance)
    - [scikit-learn's User Guide for Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html)
    - [mlxtend](http://rasbt.github.io/mlxtend/) library
    - scikit-learn-contrib libraries: [boruta_py](https://github.com/scikit-learn-contrib/boruta_py) & [stability-selection](https://github.com/scikit-learn-contrib/stability-selection)
    - [_Feature Engineering and Selection_](http://www.feat.engineering/) by Kuhn & Johnson.
- [ ] Try [statsmodels](https://www.statsmodels.org/stable/index.html) if you’re interested in more inferential statistical approach to linear regression and feature selection, looking at p values and 95% confidence intervals for the coefficients.
- [ ] Read [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf), Chapters 1-3, for more math & theory, but in an accessible, readable way.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import pandas as pd
import pandas_profiling

# Read New York City property sales data
df = pd.read_csv(DATA_PATH+'condos/NYC_Citywide_Rolling_Calendar_Sales.csv')

# Change column names: replace spaces with underscores
df.columns = [col.replace(' ', '_') for col in df]

# SALE_PRICE was read as strings.
# Remove symbols, convert to integer
df['SALE_PRICE'] = (
    df['SALE_PRICE']
    .str.replace('$','')
    .str.replace('-','')
    .str.replace(',','')
    .astype(int)
)

In [0]:
# BOROUGH is a numeric column, but arguably should be a categorical feature,
# so convert it from a number to a string
df['BOROUGH'] = df['BOROUGH'].astype(str)

In [0]:
# Reduce cardinality for NEIGHBORHOOD feature

# Get a list of the top 10 neighborhoods
top10 = df['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10, 
# replace the neighborhood with 'OTHER'
df.loc[~df['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'

In [135]:
df.head()

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
0,1,OTHER,13 CONDOS - ELEVATOR APARTMENTS,2,716,1246,,R4,"447 WEST 18TH STREET, PH12A",PH12A,10011.0,1.0,0.0,1.0,10733,1979.0,2007.0,2,R4,0,01/01/2019
1,1,OTHER,21 OFFICE BUILDINGS,4,812,68,,O5,144 WEST 37TH STREET,,10018.0,0.0,6.0,6.0,2962,15435.0,1920.0,4,O5,0,01/01/2019
2,1,OTHER,21 OFFICE BUILDINGS,4,839,69,,O5,40 WEST 38TH STREET,,10018.0,0.0,7.0,7.0,2074,11332.0,1930.0,4,O5,0,01/01/2019
3,1,OTHER,13 CONDOS - ELEVATOR APARTMENTS,2,592,1041,,R4,"1 SHERIDAN SQUARE, 8C",8C,10014.0,1.0,0.0,1.0,0,500.0,0.0,2,R4,0,01/01/2019
4,1,UPPER EAST SIDE (59-79),15 CONDOS - 2-10 UNIT RESIDENTIAL,2C,1379,1402,,R1,"20 EAST 65TH STREET, B",B,10065.0,1.0,0.0,1.0,0,6406.0,0.0,2,R1,0,01/01/2019


In [136]:
df.dtypes

BOROUGH                            object
NEIGHBORHOOD                       object
BUILDING_CLASS_CATEGORY            object
TAX_CLASS_AT_PRESENT               object
BLOCK                               int64
LOT                                 int64
EASE-MENT                         float64
BUILDING_CLASS_AT_PRESENT          object
ADDRESS                            object
APARTMENT_NUMBER                   object
ZIP_CODE                          float64
RESIDENTIAL_UNITS                 float64
COMMERCIAL_UNITS                  float64
TOTAL_UNITS                       float64
LAND_SQUARE_FEET                   object
GROSS_SQUARE_FEET                 float64
YEAR_BUILT                        float64
TAX_CLASS_AT_TIME_OF_SALE           int64
BUILDING_CLASS_AT_TIME_OF_SALE     object
SALE_PRICE                          int64
SALE_DATE                          object
dtype: object

In [0]:
# Filter our data by building class
df = df[df['BUILDING_CLASS_CATEGORY'] == '01 ONE FAMILY DWELLINGS']

In [138]:
df.head()

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
7,2,OTHER,01 ONE FAMILY DWELLINGS,1,4090,37,,A1,1193 SACKET AVENUE,,10461.0,1.0,0.0,1.0,3404,1328.0,1925.0,1,A1,0,01/01/2019
8,2,OTHER,01 ONE FAMILY DWELLINGS,1,4120,18,,A5,1215 VAN NEST AVENUE,,10461.0,1.0,0.0,1.0,2042,1728.0,1935.0,1,A5,0,01/01/2019
9,2,OTHER,01 ONE FAMILY DWELLINGS,1,4120,20,,A5,1211 VAN NEST AVENUE,,10461.0,1.0,0.0,1.0,2042,1728.0,1935.0,1,A5,0,01/01/2019
42,3,OTHER,01 ONE FAMILY DWELLINGS,1,6809,54,,A1,2601 AVENUE R,,11229.0,1.0,0.0,1.0,3333,1262.0,1925.0,1,A1,0,01/01/2019
44,3,OTHER,01 ONE FAMILY DWELLINGS,1,5495,801,,A9,4832 BAY PARKWAY,,11230.0,1.0,0.0,1.0,6800,1325.0,1930.0,1,A9,550000,01/01/2019


In [0]:
# Making a subset by selling price range
df = df[df['SALE_PRICE'].between(100000, 2000000, inclusive=False)]

In [140]:
df.shape

(3151, 21)

In [141]:
# Reformatting sales date to daytime
import datetime as dt
df['SALE_DATE'] = df['SALE_DATE'].astype('datetime64[ns]')
df.dtypes

BOROUGH                                   object
NEIGHBORHOOD                              object
BUILDING_CLASS_CATEGORY                   object
TAX_CLASS_AT_PRESENT                      object
BLOCK                                      int64
LOT                                        int64
EASE-MENT                                float64
BUILDING_CLASS_AT_PRESENT                 object
ADDRESS                                   object
APARTMENT_NUMBER                          object
ZIP_CODE                                 float64
RESIDENTIAL_UNITS                        float64
COMMERCIAL_UNITS                         float64
TOTAL_UNITS                              float64
LAND_SQUARE_FEET                          object
GROSS_SQUARE_FEET                        float64
YEAR_BUILT                               float64
TAX_CLASS_AT_TIME_OF_SALE                  int64
BUILDING_CLASS_AT_TIME_OF_SALE            object
SALE_PRICE                                 int64
SALE_DATE           

In [0]:
df = df.sort_values(by=['SALE_DATE'])

In [143]:
df.head(30)

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
44,3,OTHER,01 ONE FAMILY DWELLINGS,1,5495,801,,A9,4832 BAY PARKWAY,,11230.0,1.0,0.0,1.0,6800,1325.0,1930.0,1,A9,550000,2019-01-01
61,4,OTHER,01 ONE FAMILY DWELLINGS,1,7918,72,,A1,80-23 232ND STREET,,11427.0,1.0,0.0,1.0,4000,2001.0,1940.0,1,A1,200000,2019-01-01
193,5,OTHER,01 ONE FAMILY DWELLINGS,1,1448,1,,A2,479 MAINE AVENUE,,10314.0,1.0,0.0,1.0,3920,1850.0,1974.0,1,A2,670000,2019-01-02
185,5,OTHER,01 ONE FAMILY DWELLINGS,1,5442,145,,A9,257 DOANE AVENUE,,10308.0,1.0,0.0,1.0,2500,1392.0,1977.0,1,A9,505000,2019-01-02
184,5,OTHER,01 ONE FAMILY DWELLINGS,1,5708,35,,A2,17 RATHBUN AVENUE,,10312.0,1.0,0.0,1.0,4000,2278.0,1970.0,1,A2,552000,2019-01-02
180,4,OTHER,01 ONE FAMILY DWELLINGS,1,12985,48,,A1,132-34 BENNETT COURT,,11434.0,1.0,0.0,1.0,3000,900.0,1920.0,1,A1,570000,2019-01-02
178,4,OTHER,01 ONE FAMILY DWELLINGS,1,11936,56,,A5,143-05 110 AVENUE,,11435.0,1.0,0.0,1.0,2435,1426.0,1950.0,1,A5,480000,2019-01-02
176,4,OTHER,01 ONE FAMILY DWELLINGS,1,12352,463,,A1,170-08 116TH AVENUE,,11434.0,1.0,0.0,1.0,2500,1280.0,1925.0,1,A1,520000,2019-01-02
162,4,OTHER,01 ONE FAMILY DWELLINGS,1,12908,45,,A1,13035 230 STREET,,11413.0,1.0,0.0,1.0,3700,1535.0,1945.0,1,A1,540000,2019-01-02
160,4,OTHER,01 ONE FAMILY DWELLINGS,1,3317,24,,A3,80-46 GRENFELL STREET,,11415.0,1.0,0.0,1.0,7000,3203.0,1920.0,1,A3,1390000,2019-01-02


In [0]:
# Splitting data to train and test
train = df[df['SALE_DATE'] <= '2019-03-31']
test = df[df['SALE_DATE'] >= '2019-04-01']

In [145]:
# Let's have a look at our train data
train.describe(include='all')

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
count,2507.0,2507,2507,2507.0,2507.0,2507.0,0.0,2507,2507,1,2507.0,2507.0,2507.0,2507.0,2507.0,2507.0,2507.0,2507.0,2507,2507.0,2507
unique,5.0,7,1,2.0,,,,13,2497,1,,,,,887.0,,,,11,,68
top,4.0,OTHER,01 ONE FAMILY DWELLINGS,1.0,,,,A1,57 CHESTNUT STREET,RP.,,,,,4000.0,,,,A1,,2019-01-31 00:00:00
freq,1204.0,2360,2507,2476.0,,,,919,2,1,,,,,234.0,,,,919,,78
first,,,,,,,,,,,,,,,,,,,,,2019-01-01 00:00:00
last,,,,,,,,,,,,,,,,,,,,,2019-03-30 00:00:00
mean,,,,,6758.303949,75.778221,,,,,10993.398484,0.987635,0.016354,1.003989,,1473.744715,1944.766653,1.0,,621573.7,
std,,,,,3975.909029,157.531138,,,,,494.291462,0.110532,0.129966,0.171794,,599.217635,27.059337,0.0,,291607.2,
min,,,,,21.0,1.0,,,,,10301.0,0.0,0.0,0.0,,0.0,1890.0,1.0,,104000.0,
25%,,,,,3837.5,21.0,,,,,10314.0,1.0,0.0,1.0,,1144.0,1925.0,1.0,,440500.0,


In [146]:
train['TAX_CLASS_AT_PRESENT'].value_counts()

1     2476
1D      31
Name: TAX_CLASS_AT_PRESENT, dtype: int64

In [147]:
train['BOROUGH'].value_counts()

4    1204
5     662
3     398
2     242
1       1
Name: BOROUGH, dtype: int64

In [148]:
train['SALE_PRICE'].mean()

621573.7423214999

In [149]:
# Seems like tax class is valuable for predictions
train.groupby('TAX_CLASS_AT_PRESENT')['SALE_PRICE'].mean().sort_values()

TAX_CLASS_AT_PRESENT
1D    392900.000000
1     624436.781906
Name: SALE_PRICE, dtype: float64

In [150]:
# Remove everything that will give us noise
target = 'SALE_PRICE'
high_cardinality = ['ADDRESS', 'APARTMENT_NUMBER', 'SALE_DATE', 
                    'BUILDING_CLASS_AT_PRESENT', 'BUILDING_CLASS_CATEGORY', 'LAND_SQUARE_FEET', 
                    'BUILDING_CLASS_AT_TIME_OF_SALE', 'EASE-MENT' ]
features = train.columns.drop([target] + high_cardinality)
features

Index(['BOROUGH', 'NEIGHBORHOOD', 'TAX_CLASS_AT_PRESENT', 'BLOCK', 'LOT',
       'ZIP_CODE', 'RESIDENTIAL_UNITS', 'COMMERCIAL_UNITS', 'TOTAL_UNITS',
       'GROSS_SQUARE_FEET', 'YEAR_BUILT', 'TAX_CLASS_AT_TIME_OF_SALE'],
      dtype='object')

In [0]:
X_train = train[features]
y_train = train[target]
X_test = test[features]
y_test = test[target]

In [152]:
X_train

Unnamed: 0,BOROUGH,NEIGHBORHOOD,TAX_CLASS_AT_PRESENT,BLOCK,LOT,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE
44,3,OTHER,1,5495,801,11230.0,1.0,0.0,1.0,1325.0,1930.0,1
61,4,OTHER,1,7918,72,11427.0,1.0,0.0,1.0,2001.0,1940.0,1
193,5,OTHER,1,1448,1,10314.0,1.0,0.0,1.0,1850.0,1974.0,1
185,5,OTHER,1,5442,145,10308.0,1.0,0.0,1.0,1392.0,1977.0,1
184,5,OTHER,1,5708,35,10312.0,1.0,0.0,1.0,2278.0,1970.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...
17962,3,OTHER,1,8769,55,11235.0,1.0,0.0,1.0,1460.0,1910.0,1
17961,3,OTHER,1,8769,53,11235.0,1.0,0.0,1.0,1460.0,1910.0,1
17954,3,OTHER,1,7897,5,11234.0,1.0,0.0,1.0,1400.0,1925.0,1
17986,4,OTHER,1,5997,13,11360.0,1.0,0.0,1.0,1906.0,1945.0,1


In [0]:
# One-hot encoding begins
import category_encoders as ce
encoder = ce.OneHotEncoder(use_cat_names=True)
X_train = encoder.fit_transform(X_train)

In [154]:
X_train

Unnamed: 0,BOROUGH_3,BOROUGH_4,BOROUGH_5,BOROUGH_2,BOROUGH_1,NEIGHBORHOOD_OTHER,NEIGHBORHOOD_FLUSHING-NORTH,NEIGHBORHOOD_EAST NEW YORK,NEIGHBORHOOD_BEDFORD STUYVESANT,NEIGHBORHOOD_FOREST HILLS,NEIGHBORHOOD_BOROUGH PARK,NEIGHBORHOOD_ASTORIA,TAX_CLASS_AT_PRESENT_1,TAX_CLASS_AT_PRESENT_1D,BLOCK,LOT,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE
44,1,0,0,0,0,1,0,0,0,0,0,0,1,0,5495,801,11230.0,1.0,0.0,1.0,1325.0,1930.0,1
61,0,1,0,0,0,1,0,0,0,0,0,0,1,0,7918,72,11427.0,1.0,0.0,1.0,2001.0,1940.0,1
193,0,0,1,0,0,1,0,0,0,0,0,0,1,0,1448,1,10314.0,1.0,0.0,1.0,1850.0,1974.0,1
185,0,0,1,0,0,1,0,0,0,0,0,0,1,0,5442,145,10308.0,1.0,0.0,1.0,1392.0,1977.0,1
184,0,0,1,0,0,1,0,0,0,0,0,0,1,0,5708,35,10312.0,1.0,0.0,1.0,2278.0,1970.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17962,1,0,0,0,0,1,0,0,0,0,0,0,1,0,8769,55,11235.0,1.0,0.0,1.0,1460.0,1910.0,1
17961,1,0,0,0,0,1,0,0,0,0,0,0,1,0,8769,53,11235.0,1.0,0.0,1.0,1460.0,1910.0,1
17954,1,0,0,0,0,1,0,0,0,0,0,0,1,0,7897,5,11234.0,1.0,0.0,1.0,1400.0,1925.0,1
17986,0,1,0,0,0,1,0,0,0,0,0,0,1,0,5997,13,11360.0,1.0,0.0,1.0,1906.0,1945.0,1


In [0]:
X_test = encoder.transform(X_test)

In [156]:
X_test

Unnamed: 0,BOROUGH_3,BOROUGH_4,BOROUGH_5,BOROUGH_2,BOROUGH_1,NEIGHBORHOOD_OTHER,NEIGHBORHOOD_FLUSHING-NORTH,NEIGHBORHOOD_EAST NEW YORK,NEIGHBORHOOD_BEDFORD STUYVESANT,NEIGHBORHOOD_FOREST HILLS,NEIGHBORHOOD_BOROUGH PARK,NEIGHBORHOOD_ASTORIA,TAX_CLASS_AT_PRESENT_1,TAX_CLASS_AT_PRESENT_1D,BLOCK,LOT,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE
18398,0,1,0,0,0,1,0,0,0,0,0,0,1,0,11255,35,11411.0,1.0,0.0,1.0,1031.0,1950.0,1
18410,0,1,0,0,0,1,0,0,0,0,0,0,1,0,12175,37,11433.0,1.0,0.0,1.0,1386.0,1925.0,1
18401,0,1,0,0,0,1,0,0,0,0,0,0,1,0,9202,73,11418.0,1.0,0.0,1.0,1714.0,1915.0,1
18402,0,1,0,0,0,1,0,0,0,0,0,0,1,0,9560,16,11419.0,1.0,0.0,1.0,976.0,1925.0,1
18433,0,0,1,0,0,1,0,0,0,0,0,0,1,0,7066,42,10309.0,1.0,0.0,1.0,930.0,1992.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22944,1,0,0,0,0,1,0,0,0,0,0,0,1,0,3573,33,11212.0,1.0,0.0,1.0,1128.0,1987.0,1
22929,0,0,0,1,0,1,0,0,0,0,0,0,1,0,5052,68,10466.0,1.0,0.0,1.0,1184.0,1925.0,1
23033,0,1,0,0,0,1,0,0,0,0,0,0,1,0,12295,23,11434.0,1.0,0.0,1.0,1020.0,1935.0,1
22999,0,1,0,0,0,1,0,0,0,0,0,0,1,0,8756,48,11004.0,1.0,0.0,1.0,1682.0,1950.0,1


In [157]:
# Perform feature selection
from sklearn.feature_selection import SelectKBest, f_regression

selector = SelectKBest(score_func=f_regression, k=15)


X_train_selected = selector.fit_transform(X_train, y_train)

  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)


In [158]:
X_train_selected.shape

(2507, 15)

In [159]:
# Let's have a look what has been selected
selected_mask = selector.get_support()
all_names = X_train.columns
selected_names = all_names[selected_mask]
unselected_names = all_names[~selected_mask]

print('Features selected:')
for name in selected_names:
    print(name)

print('\n')
print('Features not selected:')
for name in unselected_names:
    print(name)

Features selected:
BOROUGH_3
BOROUGH_4
BOROUGH_5
BOROUGH_2
NEIGHBORHOOD_OTHER
NEIGHBORHOOD_FLUSHING-NORTH
NEIGHBORHOOD_FOREST HILLS
NEIGHBORHOOD_BOROUGH PARK
TAX_CLASS_AT_PRESENT_1
BLOCK
ZIP_CODE
RESIDENTIAL_UNITS
COMMERCIAL_UNITS
TOTAL_UNITS
GROSS_SQUARE_FEET


Features not selected:
BOROUGH_1
NEIGHBORHOOD_EAST NEW YORK
NEIGHBORHOOD_BEDFORD STUYVESANT
NEIGHBORHOOD_ASTORIA
TAX_CLASS_AT_PRESENT_1D
LOT
YEAR_BUILT
TAX_CLASS_AT_TIME_OF_SALE


In [160]:
# Probably we don't need all 15 features?
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

for k in range(1, len(X_train.columns) + 1):
  print(f'{k} features')

  selector = SelectKBest(score_func=f_regression, k=k)
  X_train_selected = selector.fit_transform(X_train, y_train)
  X_test_selected = selector.transform(X_test)

  model = LinearRegression()
  model.fit(X_train_selected, y_train)
  y_pred = model.predict(X_test_selected)
  mae = mean_absolute_error(y_test, y_pred)
  print(f'Test MAE: ${mae:,.0f} \n')

  # Seems like 8 is enough

  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freed

1 features
Test MAE: $183,641 

2 features
Test MAE: $184,337 

3 features
Test MAE: $183,041 

4 features
Test MAE: $177,282 

5 features
Test MAE: $177,884 

6 features
Test MAE: $177,482 

7 features
Test MAE: $172,593 

8 features
Test MAE: $171,870 

9 features
Test MAE: $171,228 

10 features
Test MAE: $163,252 

11 features
Test MAE: $164,473 

12 features
Test MAE: $160,247 

13 features
Test MAE: $160,334 

14 features
Test MAE: $160,334 

15 features
Test MAE: $160,123 

16 features
Test MAE: $160,032 

17 features
Test MAE: $160,552 

18 features
Test MAE: $160,554 

19 features
Test MAE: $160,585 

20 features
Test MAE: $160,587 

21 features
Test MAE: $160,587 

22 features


  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms


Test MAE: $160,587 

23 features
Test MAE: $160,587 



  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)


In [161]:
# Looking for our lucky 8
selector = SelectKBest(score_func=f_regression, k=8)
X_train_selected = selector.fit_transform(X_train, y_train)

  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)


In [162]:
selected_mask = selector.get_support()
all_names = X_train.columns
selected_names = all_names[selected_mask]
unselected_names = all_names[~selected_mask]

print('Features selected:')
for name in selected_names:
    print(name)

print('\n')
print('Features not selected:')
for name in unselected_names:
    print(name)

Features selected:
BOROUGH_3
BOROUGH_5
BOROUGH_2
NEIGHBORHOOD_OTHER
NEIGHBORHOOD_FLUSHING-NORTH
NEIGHBORHOOD_FOREST HILLS
ZIP_CODE
GROSS_SQUARE_FEET


Features not selected:
BOROUGH_4
BOROUGH_1
NEIGHBORHOOD_EAST NEW YORK
NEIGHBORHOOD_BEDFORD STUYVESANT
NEIGHBORHOOD_BOROUGH PARK
NEIGHBORHOOD_ASTORIA
TAX_CLASS_AT_PRESENT_1
TAX_CLASS_AT_PRESENT_1D
BLOCK
LOT
RESIDENTIAL_UNITS
COMMERCIAL_UNITS
TOTAL_UNITS
YEAR_BUILT
TAX_CLASS_AT_TIME_OF_SALE


In [0]:
# Multiple Ridge Regression model
from IPython.display import display, HTML
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge

In [164]:
for alpha in [0.001, 0.01, 0.1, 1.0, 1, 100.0, 1000.0]:
    
    # Fit Ridge Regression model
    display(HTML(f'Ridge Regression, with alpha={alpha}'))
    model = Ridge(alpha=alpha, normalize=True)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    # Get Test MAE
    mae = mean_absolute_error(y_test, y_pred)
    display(HTML(f'Test Mean Absolute Error: ${mae:,.0f}'))
    
    # Plot coefficients
    coefficients = pd.Series(model.coef_, X_train.columns)
    plt.figure(figsize=(16,8))
    coefficients.sort_values().plot.barh(color='grey')
    plt.xlim(-400,700)
    plt.show()

  from ipykernel import kernelapp as app


  from ipykernel import kernelapp as app


  from ipykernel import kernelapp as app


  from ipykernel import kernelapp as app


  from ipykernel import kernelapp as app


  from ipykernel import kernelapp as app


  from ipykernel import kernelapp as app
