<a href="https://colab.research.google.com/github/shengjiyang/DS-Unit-2-Linear-Models/blob/master/module3-ridge-regression/LS_DS_213_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 3*

---

# Ridge Regression

## Assignment

We're going back to our other **New York City** real estate dataset. Instead of predicting apartment rents, you'll predict property sales prices.

But not just for condos in Tribeca...

- [ ] Use a subset of the data where `BUILDING_CLASS_CATEGORY` == `'01 ONE FAMILY DWELLINGS'` and the sale price was more than 100 thousand and less than 2 million.
- [ ] Do train/test split. Use data from January — March 2019 to train. Use data from April 2019 to test.
- [ ] Do one-hot encoding of categorical features.
- [ ] Do feature selection with `SelectKBest`.
- [ ] Fit a ridge regression model with multiple features. Use the `normalize=True` parameter (or do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html) beforehand — use the scaler's `fit_transform` method with the train set, and the scaler's `transform` method with the test set)
- [ ] Get mean absolute error for the test set.
- [ ] As always, commit your notebook to your fork of the GitHub repo.

The [NYC Department of Finance](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page) has a glossary of property sales terms and NYC Building Class Code Descriptions. The data comes from the [NYC OpenData](https://data.cityofnewyork.us/browse?q=NYC%20calendar%20sales) portal.


## Stretch Goals

Don't worry, you aren't expected to do all these stretch goals! These are just ideas to consider and choose from.

- [ ] Add your own stretch goal(s) !
- [ ] Instead of `Ridge`, try `LinearRegression`. Depending on how many features you select, your errors will probably blow up! 💥
- [ ] Instead of `Ridge`, try [`RidgeCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html).
- [ ] Learn more about feature selection:
    - ["Permutation importance"](https://www.kaggle.com/dansbecker/permutation-importance)
    - [scikit-learn's User Guide for Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html)
    - [mlxtend](http://rasbt.github.io/mlxtend/) library
    - scikit-learn-contrib libraries: [boruta_py](https://github.com/scikit-learn-contrib/boruta_py) & [stability-selection](https://github.com/scikit-learn-contrib/stability-selection)
    - [_Feature Engineering and Selection_](http://www.feat.engineering/) by Kuhn & Johnson.
- [ ] Try [statsmodels](https://www.statsmodels.org/stable/index.html) if you’re interested in more inferential statistical approach to linear regression and feature selection, looking at p values and 95% confidence intervals for the coefficients.
- [ ] Read [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf), Chapters 1-3, for more math & theory, but in an accessible, readable way.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import pandas as pd
import pandas_profiling

# Read New York City property sales data
df = pd.read_csv(DATA_PATH+'condos/NYC_Citywide_Rolling_Calendar_Sales.csv')

# Change column names: replace spaces with underscores
# Note that here we have used list comprehension for this.
df.columns = [col.replace(' ', '_') for col in df]

# SALE_PRICE was read as strings.
# Remove symbols, convert to integer
df['SALE_PRICE'] = (
    df['SALE_PRICE']
    .str.replace('$','')
    .str.replace('-','')
    .str.replace(',','')
    .astype(int)
)

In [0]:
# BOROUGH is a numeric column, but arguably should be a categorical feature,
# so convert it from a number to a string
df['BOROUGH'] = df['BOROUGH'].astype(str)

In [0]:
# Reduce cardinality for NEIGHBORHOOD feature

# Get a list of the top 10 neighborhoods
top10 = df['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10, 
# replace the neighborhood with 'OTHER'
df.loc[~df['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'

In [5]:
df.columns

Index(['BOROUGH', 'NEIGHBORHOOD', 'BUILDING_CLASS_CATEGORY',
       'TAX_CLASS_AT_PRESENT', 'BLOCK', 'LOT', 'EASE-MENT',
       'BUILDING_CLASS_AT_PRESENT', 'ADDRESS', 'APARTMENT_NUMBER', 'ZIP_CODE',
       'RESIDENTIAL_UNITS', 'COMMERCIAL_UNITS', 'TOTAL_UNITS',
       'LAND_SQUARE_FEET', 'GROSS_SQUARE_FEET', 'YEAR_BUILT',
       'TAX_CLASS_AT_TIME_OF_SALE', 'BUILDING_CLASS_AT_TIME_OF_SALE',
       'SALE_PRICE', 'SALE_DATE'],
      dtype='object')

In [7]:
df.SALE_PRICE.value_counts()

0          6909
10          199
800000      125
750000      121
650000      120
           ... 
5236177       1
229000        1
397218        1
4112000       1
1751425       1
Name: SALE_PRICE, Length: 3831, dtype: int64

In [8]:
df.shape

(23040, 21)

In [57]:
# Creating the subset we will use below

df = df[df.BUILDING_CLASS_CATEGORY == "01 ONE FAMILY DWELLINGS"]
df = df[df.SALE_PRICE > 100000]
print(df.shape)

df = df[df.SALE_PRICE < 2000000]
df.shape

(3232, 21)


(3151, 21)

In [25]:
df.head()

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
44,3,OTHER,01 ONE FAMILY DWELLINGS,1,5495,801,,A9,4832 BAY PARKWAY,,11230.0,1.0,0.0,1.0,6800,1325.0,1930.0,1,A9,550000,01/01/2019
61,4,OTHER,01 ONE FAMILY DWELLINGS,1,7918,72,,A1,80-23 232ND STREET,,11427.0,1.0,0.0,1.0,4000,2001.0,1940.0,1,A1,200000,01/01/2019
78,2,OTHER,01 ONE FAMILY DWELLINGS,1,4210,19,,A1,1260 RHINELANDER AVE,,10461.0,1.0,0.0,1.0,3500,2043.0,1925.0,1,A1,810000,01/02/2019
108,3,OTHER,01 ONE FAMILY DWELLINGS,1,5212,69,,A1,469 E 25TH ST,,11226.0,1.0,0.0,1.0,4000,2680.0,1899.0,1,A1,125000,01/02/2019
111,3,OTHER,01 ONE FAMILY DWELLINGS,1,7930,121,,A5,5521 WHITTY LANE,,11203.0,1.0,0.0,1.0,1710,1872.0,1940.0,1,A5,620000,01/02/2019


In [26]:
df.dtypes

BOROUGH                            object
NEIGHBORHOOD                       object
BUILDING_CLASS_CATEGORY            object
TAX_CLASS_AT_PRESENT               object
BLOCK                               int64
LOT                                 int64
EASE-MENT                         float64
BUILDING_CLASS_AT_PRESENT          object
ADDRESS                            object
APARTMENT_NUMBER                   object
ZIP_CODE                          float64
RESIDENTIAL_UNITS                 float64
COMMERCIAL_UNITS                  float64
TOTAL_UNITS                       float64
LAND_SQUARE_FEET                   object
GROSS_SQUARE_FEET                 float64
YEAR_BUILT                        float64
TAX_CLASS_AT_TIME_OF_SALE           int64
BUILDING_CLASS_AT_TIME_OF_SALE     object
SALE_PRICE                          int64
SALE_DATE                          object
dtype: object

In [58]:
df.SALE_DATE = pd.to_datetime(df.SALE_DATE)
df.head()

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
44,3,OTHER,01 ONE FAMILY DWELLINGS,1,5495,801,,A9,4832 BAY PARKWAY,,11230.0,1.0,0.0,1.0,6800,1325.0,1930.0,1,A9,550000,2019-01-01
61,4,OTHER,01 ONE FAMILY DWELLINGS,1,7918,72,,A1,80-23 232ND STREET,,11427.0,1.0,0.0,1.0,4000,2001.0,1940.0,1,A1,200000,2019-01-01
78,2,OTHER,01 ONE FAMILY DWELLINGS,1,4210,19,,A1,1260 RHINELANDER AVE,,10461.0,1.0,0.0,1.0,3500,2043.0,1925.0,1,A1,810000,2019-01-02
108,3,OTHER,01 ONE FAMILY DWELLINGS,1,5212,69,,A1,469 E 25TH ST,,11226.0,1.0,0.0,1.0,4000,2680.0,1899.0,1,A1,125000,2019-01-02
111,3,OTHER,01 ONE FAMILY DWELLINGS,1,7930,121,,A5,5521 WHITTY LANE,,11203.0,1.0,0.0,1.0,1710,1872.0,1940.0,1,A5,620000,2019-01-02


In [66]:
# EASE-MENT and APARTMENT_NUMBER are both are almost nothing but NaN values,
# so I have removed them so they don't cause problems below when it comes to
# calculating mean standard error.

df.isnull().sum()

print(df["EASE-MENT"].head())
print(df.APARTMENT_NUMBER.head())

44    NaN
61    NaN
78    NaN
108   NaN
111   NaN
Name: EASE-MENT, dtype: float64
44     NaN
61     NaN
78     NaN
108    NaN
111    NaN
Name: APARTMENT_NUMBER, dtype: object


In [69]:
df = df.drop(columns = ["EASE-MENT", "APARTMENT_NUMBER"], axis = 1)
df.columns

Index(['BOROUGH', 'NEIGHBORHOOD', 'BUILDING_CLASS_CATEGORY',
       'TAX_CLASS_AT_PRESENT', 'BLOCK', 'LOT', 'BUILDING_CLASS_AT_PRESENT',
       'ADDRESS', 'ZIP_CODE', 'RESIDENTIAL_UNITS', 'COMMERCIAL_UNITS',
       'TOTAL_UNITS', 'LAND_SQUARE_FEET', 'GROSS_SQUARE_FEET', 'YEAR_BUILT',
       'TAX_CLASS_AT_TIME_OF_SALE', 'BUILDING_CLASS_AT_TIME_OF_SALE',
       'SALE_PRICE', 'SALE_DATE'],
      dtype='object')

In [108]:
# Courtesy of Connor Clark

df['SALE_DATE'] = pd.to_datetime(df['SALE_DATE'])
df = df.set_index(df['SALE_DATE'])
df = df.drop(columns = ["SALE_DATE"])
df = df.sort_index()
df

Unnamed: 0_level_0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,BUILDING_CLASS_AT_PRESENT,ADDRESS,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE
SALE_DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2019-01-01,3,OTHER,01 ONE FAMILY DWELLINGS,1,5495,801,A9,4832 BAY PARKWAY,11230.0,1.0,0.0,1.0,6800,1325.0,1930.0,1,A9,550000
2019-01-01,4,OTHER,01 ONE FAMILY DWELLINGS,1,7918,72,A1,80-23 232ND STREET,11427.0,1.0,0.0,1.0,4000,2001.0,1940.0,1,A1,200000
2019-01-02,2,OTHER,01 ONE FAMILY DWELLINGS,1,4210,19,A1,1260 RHINELANDER AVE,10461.0,1.0,0.0,1.0,3500,2043.0,1925.0,1,A1,810000
2019-01-02,3,OTHER,01 ONE FAMILY DWELLINGS,1,5212,69,A1,469 E 25TH ST,11226.0,1.0,0.0,1.0,4000,2680.0,1899.0,1,A1,125000
2019-01-02,3,OTHER,01 ONE FAMILY DWELLINGS,1,7930,121,A5,5521 WHITTY LANE,11203.0,1.0,0.0,1.0,1710,1872.0,1940.0,1,A5,620000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2019-04-30,4,OTHER,01 ONE FAMILY DWELLINGS,1,13215,3,A2,244-15 135 AVENUE,11422.0,1.0,0.0,1.0,3300,1478.0,1925.0,1,A2,635000
2019-04-30,4,OTHER,01 ONE FAMILY DWELLINGS,1,11612,73,A1,10919 132ND STREET,11420.0,1.0,0.0,1.0,2400,1280.0,1930.0,1,A1,514000
2019-04-30,4,OTHER,01 ONE FAMILY DWELLINGS,1,11808,50,A0,135-24 122ND STREET,11420.0,1.0,0.0,1.0,4000,1333.0,1945.0,1,A0,635000
2019-04-30,4,OTHER,01 ONE FAMILY DWELLINGS,1,12295,23,A1,134-34 157TH STREET,11434.0,1.0,0.0,1.0,2500,1020.0,1935.0,1,A1,545000


In [139]:
# Sorting Categorical Values by Cardinality in Ascending Order 
df.select_dtypes(exclude = 'number').describe().T.sort_values(by = 'unique')

Unnamed: 0,count,unique,top,freq
BUILDING_CLASS_CATEGORY,3151,1,01 ONE FAMILY DWELLINGS,3151
TAX_CLASS_AT_PRESENT,3151,2,1,3111
BOROUGH,3151,5,4,1580
NEIGHBORHOOD,3151,7,OTHER,2959
BUILDING_CLASS_AT_TIME_OF_SALE,3151,11,A1,1186
BUILDING_CLASS_AT_PRESENT,3151,13,A1,1185
LAND_SQUARE_FEET,3151,1035,4000,289
ADDRESS,3151,3135,94 CELESTE COURT,2


In [113]:
# SALE_PRICE seems to be a rational choice of target

# The cardinality of LATE_SQUARE_FEET, and ADRESS is way
# too high to perform One-hot encoding, so we'll drop them.

target = "SALE_PRICE"
high_cardinality = ["LAND_SQUARE_FEET", "ADDRESS"]

features = df.columns.drop([target] + high_cardinality)

df_X = df[features]
df_y = df[target]

print(df_X.shape)
print(df_y.shape)
df_X

(3151, 15)
(3151,)


Unnamed: 0_level_0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,BUILDING_CLASS_AT_PRESENT,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE
SALE_DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2019-01-01,3,OTHER,01 ONE FAMILY DWELLINGS,1,5495,801,A9,11230.0,1.0,0.0,1.0,1325.0,1930.0,1,A9
2019-01-01,4,OTHER,01 ONE FAMILY DWELLINGS,1,7918,72,A1,11427.0,1.0,0.0,1.0,2001.0,1940.0,1,A1
2019-01-02,2,OTHER,01 ONE FAMILY DWELLINGS,1,4210,19,A1,10461.0,1.0,0.0,1.0,2043.0,1925.0,1,A1
2019-01-02,3,OTHER,01 ONE FAMILY DWELLINGS,1,5212,69,A1,11226.0,1.0,0.0,1.0,2680.0,1899.0,1,A1
2019-01-02,3,OTHER,01 ONE FAMILY DWELLINGS,1,7930,121,A5,11203.0,1.0,0.0,1.0,1872.0,1940.0,1,A5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2019-04-30,4,OTHER,01 ONE FAMILY DWELLINGS,1,13215,3,A2,11422.0,1.0,0.0,1.0,1478.0,1925.0,1,A2
2019-04-30,4,OTHER,01 ONE FAMILY DWELLINGS,1,11612,73,A1,11420.0,1.0,0.0,1.0,1280.0,1930.0,1,A1
2019-04-30,4,OTHER,01 ONE FAMILY DWELLINGS,1,11808,50,A0,11420.0,1.0,0.0,1.0,1333.0,1945.0,1,A0
2019-04-30,4,OTHER,01 ONE FAMILY DWELLINGS,1,12295,23,A1,11434.0,1.0,0.0,1.0,1020.0,1935.0,1,A1


In [114]:
# Performing One-hot encoding on the DataFrame
import category_encoders as ce

encoder = ce.OneHotEncoder(use_cat_names = True)

df_X = encoder.fit_transform(df_X)

print(df_X.shape)
df_X

(3151, 48)


Unnamed: 0_level_0,BOROUGH_3,BOROUGH_4,BOROUGH_2,BOROUGH_5,BOROUGH_1,NEIGHBORHOOD_OTHER,NEIGHBORHOOD_FLUSHING-NORTH,NEIGHBORHOOD_EAST NEW YORK,NEIGHBORHOOD_BEDFORD STUYVESANT,NEIGHBORHOOD_FOREST HILLS,NEIGHBORHOOD_BOROUGH PARK,NEIGHBORHOOD_ASTORIA,BUILDING_CLASS_CATEGORY_01 ONE FAMILY DWELLINGS,TAX_CLASS_AT_PRESENT_1,TAX_CLASS_AT_PRESENT_1D,BLOCK,LOT,BUILDING_CLASS_AT_PRESENT_A9,BUILDING_CLASS_AT_PRESENT_A1,BUILDING_CLASS_AT_PRESENT_A5,BUILDING_CLASS_AT_PRESENT_A0,BUILDING_CLASS_AT_PRESENT_A2,BUILDING_CLASS_AT_PRESENT_A3,BUILDING_CLASS_AT_PRESENT_S1,BUILDING_CLASS_AT_PRESENT_A4,BUILDING_CLASS_AT_PRESENT_A6,BUILDING_CLASS_AT_PRESENT_A8,BUILDING_CLASS_AT_PRESENT_B2,BUILDING_CLASS_AT_PRESENT_S0,BUILDING_CLASS_AT_PRESENT_B3,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE_A9,BUILDING_CLASS_AT_TIME_OF_SALE_A1,BUILDING_CLASS_AT_TIME_OF_SALE_A5,BUILDING_CLASS_AT_TIME_OF_SALE_A0,BUILDING_CLASS_AT_TIME_OF_SALE_A2,BUILDING_CLASS_AT_TIME_OF_SALE_A3,BUILDING_CLASS_AT_TIME_OF_SALE_S1,BUILDING_CLASS_AT_TIME_OF_SALE_A4,BUILDING_CLASS_AT_TIME_OF_SALE_A6,BUILDING_CLASS_AT_TIME_OF_SALE_A8,BUILDING_CLASS_AT_TIME_OF_SALE_S0
SALE_DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1
2019-01-01,1,0,0,0,0,1,0,0,0,0,0,0,1,1,0,5495,801,1,0,0,0,0,0,0,0,0,0,0,0,0,11230.0,1.0,0.0,1.0,1325.0,1930.0,1,1,0,0,0,0,0,0,0,0,0,0
2019-01-01,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,7918,72,0,1,0,0,0,0,0,0,0,0,0,0,0,11427.0,1.0,0.0,1.0,2001.0,1940.0,1,0,1,0,0,0,0,0,0,0,0,0
2019-01-02,0,0,1,0,0,1,0,0,0,0,0,0,1,1,0,4210,19,0,1,0,0,0,0,0,0,0,0,0,0,0,10461.0,1.0,0.0,1.0,2043.0,1925.0,1,0,1,0,0,0,0,0,0,0,0,0
2019-01-02,1,0,0,0,0,1,0,0,0,0,0,0,1,1,0,5212,69,0,1,0,0,0,0,0,0,0,0,0,0,0,11226.0,1.0,0.0,1.0,2680.0,1899.0,1,0,1,0,0,0,0,0,0,0,0,0
2019-01-02,1,0,0,0,0,1,0,0,0,0,0,0,1,1,0,7930,121,0,0,1,0,0,0,0,0,0,0,0,0,0,11203.0,1.0,0.0,1.0,1872.0,1940.0,1,0,0,1,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2019-04-30,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,13215,3,0,0,0,0,1,0,0,0,0,0,0,0,0,11422.0,1.0,0.0,1.0,1478.0,1925.0,1,0,0,0,0,1,0,0,0,0,0,0
2019-04-30,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,11612,73,0,1,0,0,0,0,0,0,0,0,0,0,0,11420.0,1.0,0.0,1.0,1280.0,1930.0,1,0,1,0,0,0,0,0,0,0,0,0
2019-04-30,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,11808,50,0,0,0,1,0,0,0,0,0,0,0,0,0,11420.0,1.0,0.0,1.0,1333.0,1945.0,1,0,0,0,1,0,0,0,0,0,0,0
2019-04-30,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,12295,23,0,1,0,0,0,0,0,0,0,0,0,0,0,11434.0,1.0,0.0,1.0,1020.0,1935.0,1,0,1,0,0,0,0,0,0,0,0,0


In [120]:
# Courtesy of Tyler Etheridge

# Create date for split condition

import datetime
split_date = datetime.datetime(2019, 4, 1)

# January though May = train, April = test
X_train = df_X[df_X.index < split_date]
X_test = df_X[df_X.index >= split_date]

y_train = df_y[df_y.index < split_date]
y_test = df_y[df_y.index >= split_date]

X_train

Unnamed: 0_level_0,BOROUGH_3,BOROUGH_4,BOROUGH_2,BOROUGH_5,BOROUGH_1,NEIGHBORHOOD_OTHER,NEIGHBORHOOD_FLUSHING-NORTH,NEIGHBORHOOD_EAST NEW YORK,NEIGHBORHOOD_BEDFORD STUYVESANT,NEIGHBORHOOD_FOREST HILLS,NEIGHBORHOOD_BOROUGH PARK,NEIGHBORHOOD_ASTORIA,BUILDING_CLASS_CATEGORY_01 ONE FAMILY DWELLINGS,TAX_CLASS_AT_PRESENT_1,TAX_CLASS_AT_PRESENT_1D,BLOCK,LOT,BUILDING_CLASS_AT_PRESENT_A9,BUILDING_CLASS_AT_PRESENT_A1,BUILDING_CLASS_AT_PRESENT_A5,BUILDING_CLASS_AT_PRESENT_A0,BUILDING_CLASS_AT_PRESENT_A2,BUILDING_CLASS_AT_PRESENT_A3,BUILDING_CLASS_AT_PRESENT_S1,BUILDING_CLASS_AT_PRESENT_A4,BUILDING_CLASS_AT_PRESENT_A6,BUILDING_CLASS_AT_PRESENT_A8,BUILDING_CLASS_AT_PRESENT_B2,BUILDING_CLASS_AT_PRESENT_S0,BUILDING_CLASS_AT_PRESENT_B3,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE_A9,BUILDING_CLASS_AT_TIME_OF_SALE_A1,BUILDING_CLASS_AT_TIME_OF_SALE_A5,BUILDING_CLASS_AT_TIME_OF_SALE_A0,BUILDING_CLASS_AT_TIME_OF_SALE_A2,BUILDING_CLASS_AT_TIME_OF_SALE_A3,BUILDING_CLASS_AT_TIME_OF_SALE_S1,BUILDING_CLASS_AT_TIME_OF_SALE_A4,BUILDING_CLASS_AT_TIME_OF_SALE_A6,BUILDING_CLASS_AT_TIME_OF_SALE_A8,BUILDING_CLASS_AT_TIME_OF_SALE_S0
SALE_DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1
2019-01-01,1,0,0,0,0,1,0,0,0,0,0,0,1,1,0,5495,801,1,0,0,0,0,0,0,0,0,0,0,0,0,11230.0,1.0,0.0,1.0,1325.0,1930.0,1,1,0,0,0,0,0,0,0,0,0,0
2019-01-01,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,7918,72,0,1,0,0,0,0,0,0,0,0,0,0,0,11427.0,1.0,0.0,1.0,2001.0,1940.0,1,0,1,0,0,0,0,0,0,0,0,0
2019-01-02,0,0,1,0,0,1,0,0,0,0,0,0,1,1,0,4210,19,0,1,0,0,0,0,0,0,0,0,0,0,0,10461.0,1.0,0.0,1.0,2043.0,1925.0,1,0,1,0,0,0,0,0,0,0,0,0
2019-01-02,1,0,0,0,0,1,0,0,0,0,0,0,1,1,0,5212,69,0,1,0,0,0,0,0,0,0,0,0,0,0,11226.0,1.0,0.0,1.0,2680.0,1899.0,1,0,1,0,0,0,0,0,0,0,0,0
2019-01-02,1,0,0,0,0,1,0,0,0,0,0,0,1,1,0,7930,121,0,0,1,0,0,0,0,0,0,0,0,0,0,11203.0,1.0,0.0,1.0,1872.0,1940.0,1,0,0,1,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2019-03-29,0,0,0,1,0,1,0,0,0,0,0,0,1,1,0,4081,44,0,0,0,0,1,0,0,0,0,0,0,0,0,10306.0,1.0,0.0,1.0,921.0,1950.0,1,0,0,0,0,1,0,0,0,0,0,0
2019-03-29,0,0,0,1,0,1,0,0,0,0,0,0,1,1,0,2373,201,0,0,1,0,0,0,0,0,0,0,0,0,0,10314.0,1.0,0.0,1.0,2128.0,1980.0,1,0,0,1,0,0,0,0,0,0,0,0
2019-03-29,0,0,0,1,0,1,0,0,0,0,0,0,1,1,0,1132,42,0,1,0,0,0,0,0,0,0,0,0,0,0,10302.0,1.0,0.0,1.0,1807.0,2018.0,1,0,1,0,0,0,0,0,0,0,0,0
2019-03-29,0,0,0,1,0,1,0,0,0,0,0,0,1,1,0,3395,37,0,0,0,0,1,0,0,0,0,0,0,0,0,10305.0,1.0,0.0,1.0,621.0,1930.0,1,0,0,0,0,1,0,0,0,0,0,0


In [121]:
X_test

Unnamed: 0_level_0,BOROUGH_3,BOROUGH_4,BOROUGH_2,BOROUGH_5,BOROUGH_1,NEIGHBORHOOD_OTHER,NEIGHBORHOOD_FLUSHING-NORTH,NEIGHBORHOOD_EAST NEW YORK,NEIGHBORHOOD_BEDFORD STUYVESANT,NEIGHBORHOOD_FOREST HILLS,NEIGHBORHOOD_BOROUGH PARK,NEIGHBORHOOD_ASTORIA,BUILDING_CLASS_CATEGORY_01 ONE FAMILY DWELLINGS,TAX_CLASS_AT_PRESENT_1,TAX_CLASS_AT_PRESENT_1D,BLOCK,LOT,BUILDING_CLASS_AT_PRESENT_A9,BUILDING_CLASS_AT_PRESENT_A1,BUILDING_CLASS_AT_PRESENT_A5,BUILDING_CLASS_AT_PRESENT_A0,BUILDING_CLASS_AT_PRESENT_A2,BUILDING_CLASS_AT_PRESENT_A3,BUILDING_CLASS_AT_PRESENT_S1,BUILDING_CLASS_AT_PRESENT_A4,BUILDING_CLASS_AT_PRESENT_A6,BUILDING_CLASS_AT_PRESENT_A8,BUILDING_CLASS_AT_PRESENT_B2,BUILDING_CLASS_AT_PRESENT_S0,BUILDING_CLASS_AT_PRESENT_B3,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE_A9,BUILDING_CLASS_AT_TIME_OF_SALE_A1,BUILDING_CLASS_AT_TIME_OF_SALE_A5,BUILDING_CLASS_AT_TIME_OF_SALE_A0,BUILDING_CLASS_AT_TIME_OF_SALE_A2,BUILDING_CLASS_AT_TIME_OF_SALE_A3,BUILDING_CLASS_AT_TIME_OF_SALE_S1,BUILDING_CLASS_AT_TIME_OF_SALE_A4,BUILDING_CLASS_AT_TIME_OF_SALE_A6,BUILDING_CLASS_AT_TIME_OF_SALE_A8,BUILDING_CLASS_AT_TIME_OF_SALE_S0
SALE_DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1
2019-04-01,0,0,1,0,0,1,0,0,0,0,0,0,1,1,0,5913,878,0,1,0,0,0,0,0,0,0,0,0,0,0,10471.0,1.0,0.0,1.0,2272.0,1930.0,1,0,1,0,0,0,0,0,0,0,0,0
2019-04-01,0,0,1,0,0,1,0,0,0,0,0,0,1,1,0,5488,48,0,0,0,0,1,0,0,0,0,0,0,0,0,10465.0,1.0,0.0,1.0,720.0,1935.0,1,0,0,0,0,1,0,0,0,0,0,0
2019-04-01,1,0,0,0,0,1,0,0,0,0,0,0,1,1,0,5936,31,0,1,0,0,0,0,0,0,0,0,0,0,0,11209.0,1.0,0.0,1.0,2210.0,1925.0,1,0,1,0,0,0,0,0,0,0,0,0
2019-04-01,1,0,0,0,0,1,0,0,0,0,0,0,1,1,0,7813,24,0,0,1,0,0,0,0,0,0,0,0,0,0,11210.0,1.0,0.0,1.0,1520.0,1915.0,1,0,0,1,0,0,0,0,0,0,0,0
2019-04-01,1,0,0,0,0,1,0,0,0,0,0,0,1,1,0,8831,160,1,0,0,0,0,0,0,0,0,0,0,0,0,11229.0,1.0,0.0,1.0,840.0,1925.0,1,1,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2019-04-30,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,13215,3,0,0,0,0,1,0,0,0,0,0,0,0,0,11422.0,1.0,0.0,1.0,1478.0,1925.0,1,0,0,0,0,1,0,0,0,0,0,0
2019-04-30,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,11612,73,0,1,0,0,0,0,0,0,0,0,0,0,0,11420.0,1.0,0.0,1.0,1280.0,1930.0,1,0,1,0,0,0,0,0,0,0,0,0
2019-04-30,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,11808,50,0,0,0,1,0,0,0,0,0,0,0,0,0,11420.0,1.0,0.0,1.0,1333.0,1945.0,1,0,0,0,1,0,0,0,0,0,0,0
2019-04-30,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,12295,23,0,1,0,0,0,0,0,0,0,0,0,0,0,11434.0,1.0,0.0,1.0,1020.0,1935.0,1,0,1,0,0,0,0,0,0,0,0,0


In [123]:
y_train

SALE_DATE
2019-01-01    550000
2019-01-01    200000
2019-01-02    810000
2019-01-02    125000
2019-01-02    620000
               ...  
2019-03-29    330000
2019-03-29    690000
2019-03-29    610949
2019-03-29    520000
2019-03-30    104000
Name: SALE_PRICE, Length: 2507, dtype: int64

In [124]:
y_test

SALE_DATE
2019-04-01     895000
2019-04-01     253500
2019-04-01    1300000
2019-04-01     789000
2019-04-01     525000
               ...   
2019-04-30     635000
2019-04-30     514000
2019-04-30     635000
2019-04-30     545000
2019-04-30     510000
Name: SALE_PRICE, Length: 644, dtype: int64

In [0]:
# I have chosen to use SelectKBest in order to determine the number
# features which will result in the lowest test error.

# Note that I have also chosen to use Ridge Regression rather than
# standard Linear Regression Below.

from sklearn.feature_selection import f_regression, SelectKBest
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_absolute_error

ks = []
test_errors = []

for k in range(1, len(X_train.columns) + 1):
  ks.append(k)

  selector = SelectKBest(score_func = f_regression, k = k)
  X_train_selected = selector.fit_transform(X_train, y_train)
  X_test_selected = selector.fit_transform(X_test, y_test)

  model = Ridge(normalize = False)
  model.fit(X_train_selected, y_train)
  y_pred = model.predict(X_test_selected)

  mae = mean_absolute_error(y_test, y_pred)
  test_errors.append(mae)

In [199]:
# Using this method, we find that k = 21 is
# the ideal value for minimizing test error.

d = {"k" : ks, "Test Error" : test_errors}

SelectK = pd.DataFrame(d, columns = ["k", "Test Error"])
SelectK.loc[SelectK["Test Error"] == SelectK["Test Error"].max()]

Unnamed: 0,k,Test Error
20,21,3806523000.0


In [0]:
# Repeating the above Steps with Linear Regression instead,
# in order to fulfill the stretch goal requirement

from sklearn.linear_model import LinearRegression

ks = []
test_errors = []

for k in range(1, len(X_train.columns) + 1):
  ks.append(k)

  selector = SelectKBest(score_func = f_regression, k = k)
  X_train_selected = selector.fit_transform(X_train, y_train)
  X_test_selected = selector.fit_transform(X_test, y_test)

  model = LinearRegression()
  model.fit(X_train_selected, y_train)
  y_pred = model.predict(X_test_selected)

  mae = mean_absolute_error(y_test, y_pred)
  test_errors.append(mae)

In [208]:
l = {"k" : ks, "Test Error" : test_errors}

LinearK = pd.DataFrame(d, columns = ["k", "Test Error"])
print("The errors did indeel blow up!")

LinearK.tail()

The errors did indeel blow up!


Unnamed: 0,k,Test Error
43,44,362836500.0
44,45,2814508000.0
45,46,2807836000.0
46,47,157895.0
47,48,157895.0


In [0]:
# Now to actually perform the feature selection...

selector_21 = SelectKBest(score_func = f_regression, k = 21)

X_train_selected = selector_21.fit_transform(X_train, y_train)
X_test_selected = selector_21.fit_transform(X_test, y_test)

In [142]:
X_train_selected.shape, X_test_selected.shape

((2507, 21), (644, 21))

In [144]:
# Let's see what features were selected and which were not...
# NOTE TO SELF: 要多研究selector.get_support()來了解以下的電腦怎麼用以下的code

all_names = X_train.columns
selected_mask = selector_21.get_support()

selected_names = all_names[selected_mask]
unselected_names = all_names[~selected_mask]

print("Features selected:\n")
for name in selected_names:
  print(name)

print("\n")

print('Features not selected:\n')
for name in unselected_names:
  print(name)

Features selected:

BOROUGH_3
BOROUGH_2
BOROUGH_5
BOROUGH_1
NEIGHBORHOOD_OTHER
NEIGHBORHOOD_FLUSHING-NORTH
NEIGHBORHOOD_BEDFORD STUYVESANT
NEIGHBORHOOD_FOREST HILLS
NEIGHBORHOOD_BOROUGH PARK
TAX_CLASS_AT_PRESENT_1
TAX_CLASS_AT_PRESENT_1D
BLOCK
LOT
BUILDING_CLASS_AT_PRESENT_A3
BUILDING_CLASS_AT_PRESENT_A8
ZIP_CODE
RESIDENTIAL_UNITS
TOTAL_UNITS
GROSS_SQUARE_FEET
BUILDING_CLASS_AT_TIME_OF_SALE_A3
BUILDING_CLASS_AT_TIME_OF_SALE_A8


Features not selected:

BOROUGH_4
NEIGHBORHOOD_EAST NEW YORK
NEIGHBORHOOD_ASTORIA
BUILDING_CLASS_CATEGORY_01 ONE FAMILY DWELLINGS
BUILDING_CLASS_AT_PRESENT_A9
BUILDING_CLASS_AT_PRESENT_A1
BUILDING_CLASS_AT_PRESENT_A5
BUILDING_CLASS_AT_PRESENT_A0
BUILDING_CLASS_AT_PRESENT_A2
BUILDING_CLASS_AT_PRESENT_S1
BUILDING_CLASS_AT_PRESENT_A4
BUILDING_CLASS_AT_PRESENT_A6
BUILDING_CLASS_AT_PRESENT_B2
BUILDING_CLASS_AT_PRESENT_S0
BUILDING_CLASS_AT_PRESENT_B3
COMMERCIAL_UNITS
YEAR_BUILT
TAX_CLASS_AT_TIME_OF_SALE
BUILDING_CLASS_AT_TIME_OF_SALE_A9
BUILDING_CLASS_AT_TIME_OF_SALE

In [145]:
# Although, technically, I have already fit a Ridge Regression Model above,
# I will numeric data from the DataFrame build a number of Ridge Models with
# scaled alpha/lambda values.

df.select_dtypes(include = 'number').describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
BLOCK,3151.0,6917.976515,3963.326705,21.0,4016.0,6301.0,10208.5,16350.0
LOT,3151.0,75.981593,161.089514,1.0,21.0,42.0,69.0,2720.0
ZIP_CODE,3151.0,11027.219613,482.875284,10030.0,10461.0,11235.0,11413.0,11697.0
RESIDENTIAL_UNITS,3151.0,0.987623,0.113414,0.0,1.0,1.0,1.0,2.0
COMMERCIAL_UNITS,3151.0,0.015868,0.127499,0.0,0.0,0.0,0.0,2.0
TOTAL_UNITS,3151.0,1.003491,0.171789,0.0,1.0,1.0,1.0,3.0
GROSS_SQUARE_FEET,3151.0,1470.306887,586.3392,0.0,1144.0,1360.0,1683.0,7875.0
YEAR_BUILT,3151.0,1943.6947,26.676786,1890.0,1925.0,1938.0,1955.0,2018.0
TAX_CLASS_AT_TIME_OF_SALE,3151.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
SALE_PRICE,3151.0,628560.112663,292990.378313,104000.0,447500.0,568000.0,760000.0,1955000.0


In [146]:
df.select_dtypes(include = 'number')

Unnamed: 0_level_0,BLOCK,LOT,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,SALE_PRICE
SALE_DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2019-01-01,5495,801,11230.0,1.0,0.0,1.0,1325.0,1930.0,1,550000
2019-01-01,7918,72,11427.0,1.0,0.0,1.0,2001.0,1940.0,1,200000
2019-01-02,4210,19,10461.0,1.0,0.0,1.0,2043.0,1925.0,1,810000
2019-01-02,5212,69,11226.0,1.0,0.0,1.0,2680.0,1899.0,1,125000
2019-01-02,7930,121,11203.0,1.0,0.0,1.0,1872.0,1940.0,1,620000
...,...,...,...,...,...,...,...,...,...,...
2019-04-30,13215,3,11422.0,1.0,0.0,1.0,1478.0,1925.0,1,635000
2019-04-30,11612,73,11420.0,1.0,0.0,1.0,1280.0,1930.0,1,514000
2019-04-30,11808,50,11420.0,1.0,0.0,1.0,1333.0,1945.0,1,635000
2019-04-30,12295,23,11434.0,1.0,0.0,1.0,1020.0,1935.0,1,545000


In [155]:
# In order do something different for a change,
# let's try and predict YEAR_BUILT using all other
# numeric features as independent variables
# Using Ridge Regression...

n_target = "YEAR_BUILT"

n_df = df.select_dtypes(include = 'number').drop(n_target, axis = 1)
n_features = n_df.columns

n_df_X = df[n_features]
n_df_y = df[n_target]

n_df_X

Unnamed: 0_level_0,BLOCK,LOT,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,GROSS_SQUARE_FEET,TAX_CLASS_AT_TIME_OF_SALE,SALE_PRICE
SALE_DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2019-01-01,5495,801,11230.0,1.0,0.0,1.0,1325.0,1,550000
2019-01-01,7918,72,11427.0,1.0,0.0,1.0,2001.0,1,200000
2019-01-02,4210,19,10461.0,1.0,0.0,1.0,2043.0,1,810000
2019-01-02,5212,69,11226.0,1.0,0.0,1.0,2680.0,1,125000
2019-01-02,7930,121,11203.0,1.0,0.0,1.0,1872.0,1,620000
...,...,...,...,...,...,...,...,...,...
2019-04-30,13215,3,11422.0,1.0,0.0,1.0,1478.0,1,635000
2019-04-30,11612,73,11420.0,1.0,0.0,1.0,1280.0,1,514000
2019-04-30,11808,50,11420.0,1.0,0.0,1.0,1333.0,1,635000
2019-04-30,12295,23,11434.0,1.0,0.0,1.0,1020.0,1,545000


In [0]:
print("Earliest: ", df.YEAR_BUILT

In [0]:
# INSTRUCTIONS:Fit a ridge regression model with multiple features.
# Use the normalize=True parameter (or do feature scaling beforehand — use the
# scaler's fit_transform method with the train set, and the scaler's transform
# method with the test set)

# SIDE NOTE: If you knew you were going to do Ridge Regression from the
# beginning, who would ever choose to do the latter?

In [156]:
# Courtesy of Tyler Etheridge

# Create date for split condition

import datetime
split_date = datetime.datetime(2019, 4, 1)

# January though May = train, April = test
n_X_train = n_df_X[df_X.index < split_date]
n_X_test = n_df_X[df_X.index >= split_date]

n_y_train = n_df_y[df_y.index < split_date]
n_y_test = n_df_y[df_y.index >= split_date]

n_X_train

Unnamed: 0_level_0,BLOCK,LOT,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,GROSS_SQUARE_FEET,TAX_CLASS_AT_TIME_OF_SALE,SALE_PRICE
SALE_DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2019-01-01,5495,801,11230.0,1.0,0.0,1.0,1325.0,1,550000
2019-01-01,7918,72,11427.0,1.0,0.0,1.0,2001.0,1,200000
2019-01-02,4210,19,10461.0,1.0,0.0,1.0,2043.0,1,810000
2019-01-02,5212,69,11226.0,1.0,0.0,1.0,2680.0,1,125000
2019-01-02,7930,121,11203.0,1.0,0.0,1.0,1872.0,1,620000
...,...,...,...,...,...,...,...,...,...
2019-03-29,4081,44,10306.0,1.0,0.0,1.0,921.0,1,330000
2019-03-29,2373,201,10314.0,1.0,0.0,1.0,2128.0,1,690000
2019-03-29,1132,42,10302.0,1.0,0.0,1.0,1807.0,1,610949
2019-03-29,3395,37,10305.0,1.0,0.0,1.0,621.0,1,520000


In [157]:
n_y_train

SALE_DATE
2019-01-01    1930.0
2019-01-01    1940.0
2019-01-02    1925.0
2019-01-02    1899.0
2019-01-02    1940.0
               ...  
2019-03-29    1950.0
2019-03-29    1980.0
2019-03-29    2018.0
2019-03-29    1930.0
2019-03-30    1950.0
Name: YEAR_BUILT, Length: 2507, dtype: float64

In [186]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures

deg = 2
alpha = [0.001, 0.01, 0.02, 0.1, 1.0, 10.0, 100.0, 1000.0]

print("Degree:     ", deg)

for alpha in alpha:
  n_model = make_pipeline(PolynomialFeatures(degree = deg), Ridge(alpha = alpha))
  n_model.fit(n_X_train, n_y_train)

  y_pred = n_model.predict(n_X_test)
  mae = mean_absolute_error(n_y_test, y_pred)
  print("alpha:      ", alpha)
  print("Test Error: ", mae, "\n")

Degree:      2
alpha:       0.001
Test Error:  194.39946478390723 

alpha:       0.01
Test Error:  34.23173465015319 

alpha:       0.02
Test Error:  16.449852015225265 

alpha:       0.1
Test Error:  16.50481155625491 

alpha:       1.0
Test Error:  16.59371792289542 

alpha:       10.0
Test Error:  16.6202565318353 

alpha:       100.0
Test Error:  16.627154896027932 

alpha:       1000.0
Test Error:  16.630194581006954 



  overwrite_a=True).T
  overwrite_a=True).T
  overwrite_a=True).T
  overwrite_a=True).T
  overwrite_a=True).T
  overwrite_a=True).T


In [187]:
print("From the data above, we can see that out of the models we have created")
print("using Ridge Regression, a second-degree polynomial with an alpha value")
print("of 0.02 is able to predict the year a given piece of real estate")
print("accurately within 16.5 years.")

From the data above, we can see that out of the models we have created
using Ridge Regression, a second-degree polynomial with an alpha value
of 0.02 is able to predict the year a given piece of real estate
accurately within 16.5 years.


In [195]:
# Now to repeat the process, but this time,
# we will pass normalize = True into the model

deg = 2
alpha = [0.001, 0.002, 0.01, 0.02, 0.1, 1.0, 10.0, 100.0, 1000.0]

print("Degree:     ", deg)

for alpha in alpha:
  n_model = make_pipeline(PolynomialFeatures(degree = deg),
                          Ridge(alpha = alpha, normalize = True))
  
  n_model.fit(n_X_train, n_y_train)

  y_pred = n_model.predict(n_X_test)
  mae = mean_absolute_error(n_y_test, y_pred)
  print("alpha:      ", alpha)
  print("Test Error: ", mae, "\n")

Degree:      2
alpha:       0.001
Test Error:  17.24714981397474 

alpha:       0.002
Test Error:  17.244710960090117 

alpha:       0.01
Test Error:  17.247316856744785 

alpha:       0.02
Test Error:  17.257196215183974 

alpha:       0.1
Test Error:  17.453976926535656 

alpha:       1.0
Test Error:  17.96146894161844 

alpha:       10.0
Test Error:  19.210507406604293 

alpha:       100.0
Test Error:  20.133862056325626 

alpha:       1000.0
Test Error:  20.293209686064245 



In [201]:
print("Among the normalized models, the most accurate is the one we have")
print("discovered is a second degree polynomial with an alpha value of 0.002.")

print("\nUnfortunately, we may have deceived ourselves slightly with our")
print("original model. The normalized model is only able to predict the year")
print("a piece of real estate was built within 17.2 years.")

Among the normalized models, the most accurate is the one we have
discovered is a second degree polynomial with an alpha value of 0.002.

Unfortunately, we may have deceived ourselves slightly with our
original model. The normalized model is only able to predict the year
a piece of real estate was built within 17.2 years.


In [210]:
print(n_X_train.shape)
print(n_y_train.shape)

(2507, 9)
(2507,)


In [234]:
# Playing with RidgeCV
alpha = (0.001, 0.002, 0.01, 0.02, 0.1, 1.0, 10.0, 100.0, 1000.0)

from sklearn.linear_model import RidgeCV

ridge = RidgeCV(alphas = alpha, normalize = True)
ridge.fit(n_X_train, n_y_train)

print("alpha:      ", ridge.alpha_)
print("Test Error: ", mean_absolute_error(n_y_test, ridge.predict(n_X_test)))

print("\nCompare with our above findings for alpha = 0.002 in the normalized models.")

alpha:       0.002
Test Error:  17.83711535737854

Compare with our above findings for alpha = 0.002 in the normalized models.
