<a href="https://colab.research.google.com/github/ilEnzio/DS-Unit-2-Linear-Models/blob/master/module3-ridge-regression/ERLE_GRANGERII_DS18__LS_DS_213_assignment_Ridge_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 3*

---

# Ridge Regression

## Assignment

We're going back to our other **New York City** real estate dataset. Instead of predicting apartment rents, you'll predict property sales prices.

But not just for condos in Tribeca...

- [ ] Use a subset of the data where `BUILDING_CLASS_CATEGORY` == `'01 ONE FAMILY DWELLINGS'` and the sale price was more than 100 thousand and less than 2 million.
- [ ] Do train/test split. Use data from January — March 2019 to train. Use data from April 2019 to test.
- [ ] Do one-hot encoding of categorical features.
- [ ] Do feature selection with `SelectKBest`.
- [ ] Fit a ridge regression model with multiple features. Use the `normalize=True` parameter (or do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html) beforehand — use the scaler's `fit_transform` method with the train set, and the scaler's `transform` method with the test set)
- [ ] Get mean absolute error for the test set.
- [ ] As always, commit your notebook to your fork of the GitHub repo.

The [NYC Department of Finance](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page) has a glossary of property sales terms and NYC Building Class Code Descriptions. The data comes from the [NYC OpenData](https://data.cityofnewyork.us/browse?q=NYC%20calendar%20sales) portal.


## Stretch Goals

Don't worry, you aren't expected to do all these stretch goals! These are just ideas to consider and choose from.

- [ ] Add your own stretch goal(s) !
- [ ] Instead of `Ridge`, try `LinearRegression`. Depending on how many features you select, your errors will probably blow up! 💥
- [ ] Instead of `Ridge`, try [`RidgeCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html).
- [ ] Learn more about feature selection:
    - ["Permutation importance"](https://www.kaggle.com/dansbecker/permutation-importance)
    - [scikit-learn's User Guide for Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html)
    - [mlxtend](http://rasbt.github.io/mlxtend/) library
    - scikit-learn-contrib libraries: [boruta_py](https://github.com/scikit-learn-contrib/boruta_py) & [stability-selection](https://github.com/scikit-learn-contrib/stability-selection)
    - [_Feature Engineering and Selection_](http://www.feat.engineering/) by Kuhn & Johnson.
- [ ] Try [statsmodels](https://www.statsmodels.org/stable/index.html) if you’re interested in more inferential statistical approach to linear regression and feature selection, looking at p values and 95% confidence intervals for the coefficients.
- [ ] Read [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf), Chapters 1-3, for more math & theory, but in an accessible, readable way.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [594]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [595]:
import pandas as pd
import pandas_profiling

# Read New York City property sales data
df = pd.read_csv(DATA_PATH+'condos/NYC_Citywide_Rolling_Calendar_Sales.csv')

# Change column names: replace spaces with underscores
df.columns = [col.replace(' ', '_') for col in df]

# SALE_PRICE was read as strings.
# Remove symbols, convert to integer
df['SALE_PRICE'] = (
    df['SALE_PRICE']
    .str.replace('$','')
    .str.replace('-','')
    .str.replace(',','')
    .astype(int)
)

In [596]:
# BOROUGH is a numeric column, but arguably should be a categorical feature,
# so convert it from a number to a string
df['BOROUGH'] = df['BOROUGH'].astype(str)

In [597]:
# Reduce cardinality for NEIGHBORHOOD feature

# Get a list of the top 10 neighborhoods
top10 = df['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10, 
# replace the neighborhood with 'OTHER'
df.loc[~df['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'

In [598]:
df.head()

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
0,1,OTHER,13 CONDOS - ELEVATOR APARTMENTS,2,716,1246,,R4,"447 WEST 18TH STREET, PH12A",PH12A,10011.0,1.0,0.0,1.0,10733,1979.0,2007.0,2,R4,0,01/01/2019
1,1,OTHER,21 OFFICE BUILDINGS,4,812,68,,O5,144 WEST 37TH STREET,,10018.0,0.0,6.0,6.0,2962,15435.0,1920.0,4,O5,0,01/01/2019
2,1,OTHER,21 OFFICE BUILDINGS,4,839,69,,O5,40 WEST 38TH STREET,,10018.0,0.0,7.0,7.0,2074,11332.0,1930.0,4,O5,0,01/01/2019
3,1,OTHER,13 CONDOS - ELEVATOR APARTMENTS,2,592,1041,,R4,"1 SHERIDAN SQUARE, 8C",8C,10014.0,1.0,0.0,1.0,0,500.0,0.0,2,R4,0,01/01/2019
4,1,UPPER EAST SIDE (59-79),15 CONDOS - 2-10 UNIT RESIDENTIAL,2C,1379,1402,,R1,"20 EAST 65TH STREET, B",B,10065.0,1.0,0.0,1.0,0,6406.0,0.0,2,R1,0,01/01/2019


In [599]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23040 entries, 0 to 23039
Data columns (total 21 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   BOROUGH                         23040 non-null  object 
 1   NEIGHBORHOOD                    23040 non-null  object 
 2   BUILDING_CLASS_CATEGORY         23040 non-null  object 
 3   TAX_CLASS_AT_PRESENT            23039 non-null  object 
 4   BLOCK                           23040 non-null  int64  
 5   LOT                             23040 non-null  int64  
 6   EASE-MENT                       0 non-null      float64
 7   BUILDING_CLASS_AT_PRESENT       23039 non-null  object 
 8   ADDRESS                         23040 non-null  object 
 9   APARTMENT_NUMBER                5201 non-null   object 
 10  ZIP_CODE                        23039 non-null  float64
 11  RESIDENTIAL_UNITS               23039 non-null  float64
 12  COMMERCIAL_UNITS                

In [600]:
df["SALE_PRICE"].value_counts()

0          6909
10          199
800000      125
750000      121
650000      120
           ... 
5236177       1
229000        1
397218        1
4112000       1
1751425       1
Name: SALE_PRICE, Length: 3831, dtype: int64

In [601]:
df.columns

Index(['BOROUGH', 'NEIGHBORHOOD', 'BUILDING_CLASS_CATEGORY',
       'TAX_CLASS_AT_PRESENT', 'BLOCK', 'LOT', 'EASE-MENT',
       'BUILDING_CLASS_AT_PRESENT', 'ADDRESS', 'APARTMENT_NUMBER', 'ZIP_CODE',
       'RESIDENTIAL_UNITS', 'COMMERCIAL_UNITS', 'TOTAL_UNITS',
       'LAND_SQUARE_FEET', 'GROSS_SQUARE_FEET', 'YEAR_BUILT',
       'TAX_CLASS_AT_TIME_OF_SALE', 'BUILDING_CLASS_AT_TIME_OF_SALE',
       'SALE_PRICE', 'SALE_DATE'],
      dtype='object')

In [602]:
df["BUILDING_CLASS_CATEGORY"].value_counts()

01 ONE FAMILY DWELLINGS                       5061
02 TWO FAMILY DWELLINGS                       4567
10 COOPS - ELEVATOR APARTMENTS                3471
13 CONDOS - ELEVATOR APARTMENTS               3339
03 THREE FAMILY DWELLINGS                     1438
07 RENTALS - WALKUP APARTMENTS                 807
09 COOPS - WALKUP APARTMENTS                   672
15 CONDOS - 2-10 UNIT RESIDENTIAL              421
04 TAX CLASS 1 CONDOS                          418
44 CONDO PARKING                               366
17 CONDO COOPS                                 300
05 TAX CLASS 1 VACANT LAND                     288
22 STORE BUILDINGS                             288
12 CONDOS - WALKUP APARTMENTS                  256
14 RENTALS - 4-10 UNIT                         200
29 COMMERCIAL GARAGES                          147
08 RENTALS - ELEVATOR APARTMENTS               120
30 WAREHOUSES                                  105
21 OFFICE BUILDINGS                             96
31 COMMERCIAL VACANT LAND      

In [603]:
# just make a copy so as to preserve the original dataset from this point. 

nyc_realestate_df = df.copy()
nyc_realestate_df.shape

(23040, 21)

## Import Libraries and Modules

In [604]:
import matplotlib.pyplot as plt
import numpy as np


# special sauce for custom method :) 
from collections import namedtuple

# ML models
from sklearn.linear_model import LinearRegression # old reliable model
from sklearn.linear_model import Ridge # RidgeRegression to guard against overfit

# Metric accessing functions
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Transformers
from category_encoders import OneHotEncoder # splits cat variable 
from sklearn.feature_selection import SelectKBest # selects highest correlating features


# List of things to do:

### * drop EASE-MENT
### * Change Zip to cat
### * LAND_SQUARE_FEET to float or int
### * Change SALE_DATE to datetime 

/n
### * subset/filter on "BUILDING_CLASS_CATEGORY" for 01 ONE FAMILY DWELLINGS  
### * subset that^ for 100,000 > sale price < 2,000,000 


In [605]:
# drop Ease-ment
nyc_realestate_df=  nyc_realestate_df.drop(labels="EASE-MENT", axis=1)
nyc_realestate_df.columns

Index(['BOROUGH', 'NEIGHBORHOOD', 'BUILDING_CLASS_CATEGORY',
       'TAX_CLASS_AT_PRESENT', 'BLOCK', 'LOT', 'BUILDING_CLASS_AT_PRESENT',
       'ADDRESS', 'APARTMENT_NUMBER', 'ZIP_CODE', 'RESIDENTIAL_UNITS',
       'COMMERCIAL_UNITS', 'TOTAL_UNITS', 'LAND_SQUARE_FEET',
       'GROSS_SQUARE_FEET', 'YEAR_BUILT', 'TAX_CLASS_AT_TIME_OF_SALE',
       'BUILDING_CLASS_AT_TIME_OF_SALE', 'SALE_PRICE', 'SALE_DATE'],
      dtype='object')

In [606]:
# change Categories of zip_code, land_square_feet, Sale_date 
# dictionary of new types
new_types = {'ZIP_CODE' : 'category', 'LAND_SQUARE_FEET': 'int32',
             'SALE_DATE':"datetime64"}

In [607]:
# drop_cond = nyc_realestate_df.loc[:, "LAND_SQUARE_FEET"] == np.NaN
drop_cond = nyc_realestate_df["LAND_SQUARE_FEET"].isnull() == False

In [608]:
drop_cond

0        True
1        True
2        True
3        True
4        True
         ... 
23035    True
23036    True
23037    True
23038    True
23039    True
Name: LAND_SQUARE_FEET, Length: 23040, dtype: bool

In [609]:
nyc_realestate_df = nyc_realestate_df[drop_cond]
nyc_realestate_df.isnull().sum()

BOROUGH                               0
NEIGHBORHOOD                          0
BUILDING_CLASS_CATEGORY               0
TAX_CLASS_AT_PRESENT                  0
BLOCK                                 0
LOT                                   0
BUILDING_CLASS_AT_PRESENT             0
ADDRESS                               0
APARTMENT_NUMBER                  17838
ZIP_CODE                              0
RESIDENTIAL_UNITS                     0
COMMERCIAL_UNITS                      0
TOTAL_UNITS                           0
LAND_SQUARE_FEET                      0
GROSS_SQUARE_FEET                     0
YEAR_BUILT                           34
TAX_CLASS_AT_TIME_OF_SALE             0
BUILDING_CLASS_AT_TIME_OF_SALE        0
SALE_PRICE                            0
SALE_DATE                             0
dtype: int64

In [610]:
nyc_realestate_df= nyc_realestate_df.astype({'ZIP_CODE' : 'category',
             'SALE_DATE':"datetime64"})
nyc_realestate_df.columns

Index(['BOROUGH', 'NEIGHBORHOOD', 'BUILDING_CLASS_CATEGORY',
       'TAX_CLASS_AT_PRESENT', 'BLOCK', 'LOT', 'BUILDING_CLASS_AT_PRESENT',
       'ADDRESS', 'APARTMENT_NUMBER', 'ZIP_CODE', 'RESIDENTIAL_UNITS',
       'COMMERCIAL_UNITS', 'TOTAL_UNITS', 'LAND_SQUARE_FEET',
       'GROSS_SQUARE_FEET', 'YEAR_BUILT', 'TAX_CLASS_AT_TIME_OF_SALE',
       'BUILDING_CLASS_AT_TIME_OF_SALE', 'SALE_PRICE', 'SALE_DATE'],
      dtype='object')

In [611]:
nyc_realestate_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22987 entries, 0 to 23039
Data columns (total 20 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   BOROUGH                         22987 non-null  object        
 1   NEIGHBORHOOD                    22987 non-null  object        
 2   BUILDING_CLASS_CATEGORY         22987 non-null  object        
 3   TAX_CLASS_AT_PRESENT            22987 non-null  object        
 4   BLOCK                           22987 non-null  int64         
 5   LOT                             22987 non-null  int64         
 6   BUILDING_CLASS_AT_PRESENT       22987 non-null  object        
 7   ADDRESS                         22987 non-null  object        
 8   APARTMENT_NUMBER                5149 non-null   object        
 9   ZIP_CODE                        22987 non-null  category      
 10  RESIDENTIAL_UNITS               22987 non-null  float64       
 11  CO

In [612]:
# I dropped the Nan rows from Land Square feet but still can't convert it 
# to numerical... I gotta give up at this point :(
    

In [613]:
nyc_realestate_df['LAND_SQUARE_FEET'] = nyc_realestate_df["LAND_SQUARE_FEET"].str.strip()

In [614]:
nyc_realestate_df["LAND_SQUARE_FEET"].value_counts()


0        7500
2,000    1106
2,500    1045
4,000     876
3,000     369
         ... 
4,013       1
5,036       1
3,969       1
2,537       1
5,085       1
Name: LAND_SQUARE_FEET, Length: 3652, dtype: int64

In [615]:
nyc_realestate_df['LAND_SQUARE_FEET']  =pd.DataFrame({"testing" : [x.replace(",","") for x in nyc_realestate_df["LAND_SQUARE_FEET"].str.strip()]})
nyc_realestate_df['LAND_SQUARE_FEET'] 

0        10733
1         2962
2         2074
3            0
4            0
         ...  
23035      NaN
23036      NaN
23037      NaN
23038      NaN
23039      NaN
Name: LAND_SQUARE_FEET, Length: 22987, dtype: object

In [616]:
# nyc_realestate_df['LAND_SQUARE_FEET'] =nyc_realestate_df['LAND_SQUARE_FEET'].astype("float")

In [617]:
error = nyc_realestate_df['LAND_SQUARE_FEET'] == '########'

In [618]:
nyc_realestate_df[error]

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
18478,1,UPPER EAST SIDE (59-79),10 COOPS - ELEVATOR APARTMENTS,2,1376,51,D4,"30 EAST 62ND STREET, 9C",,10065.0,0.0,0.0,0.0,########,0.0,1958.0,2,D4,2225000,2019-04-02


In [619]:
nyc_realestate_df =nyc_realestate_df.drop(labels=18478, axis=0)

In [620]:
nyc_realestate_df['LAND_SQUARE_FEET'] = nyc_realestate_df['LAND_SQUARE_FEET'].astype("float")

In [621]:
nyc_realestate_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22986 entries, 0 to 23039
Data columns (total 20 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   BOROUGH                         22986 non-null  object        
 1   NEIGHBORHOOD                    22986 non-null  object        
 2   BUILDING_CLASS_CATEGORY         22986 non-null  object        
 3   TAX_CLASS_AT_PRESENT            22986 non-null  object        
 4   BLOCK                           22986 non-null  int64         
 5   LOT                             22986 non-null  int64         
 6   BUILDING_CLASS_AT_PRESENT       22986 non-null  object        
 7   ADDRESS                         22986 non-null  object        
 8   APARTMENT_NUMBER                5149 non-null   object        
 9   ZIP_CODE                        22986 non-null  category      
 10  RESIDENTIAL_UNITS               22986 non-null  float64       
 11  CO

### Create One Family Dwelling DF, w/ Sale Price between 100k and 2mil

In [622]:
# subset for only the one family dwelling
single_fam_cond = nyc_realestate_df['BUILDING_CLASS_CATEGORY'] == "01 ONE FAMILY DWELLINGS"
single_fam_cond

0        False
1        False
2        False
3        False
4        False
         ...  
23035     True
23036    False
23037    False
23038    False
23039    False
Name: BUILDING_CLASS_CATEGORY, Length: 22986, dtype: bool

In [623]:
nyc_single_fam_df = nyc_realestate_df[single_fam_cond]
nyc_single_fam_df.shape

(5061, 20)

In [624]:
nyc_single_fam_df["SALE_PRICE"].dtype

dtype('int64')

In [625]:
price_cond =  (nyc_single_fam_df["SALE_PRICE"] > 100000) & (nyc_single_fam_df["SALE_PRICE"] < 2000000)
price_cond

7        False
8        False
9        False
42       False
44        True
         ...  
23029     True
23031     True
23032     True
23033     True
23035     True
Name: SALE_PRICE, Length: 5061, dtype: bool

In [626]:
nyc_single_fam_df = nyc_single_fam_df[price_cond]

In [627]:
nyc_single_fam_df.shape

(3151, 20)

### Let's make a Feature Matrix starting with this.

In [628]:
nyc_single_fam_df.head()

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
44,3,OTHER,01 ONE FAMILY DWELLINGS,1,5495,801,A9,4832 BAY PARKWAY,,11230.0,1.0,0.0,1.0,6800.0,1325.0,1930.0,1,A9,550000,2019-01-01
61,4,OTHER,01 ONE FAMILY DWELLINGS,1,7918,72,A1,80-23 232ND STREET,,11427.0,1.0,0.0,1.0,4000.0,2001.0,1940.0,1,A1,200000,2019-01-01
78,2,OTHER,01 ONE FAMILY DWELLINGS,1,4210,19,A1,1260 RHINELANDER AVE,,10461.0,1.0,0.0,1.0,3500.0,2043.0,1925.0,1,A1,810000,2019-01-02
108,3,OTHER,01 ONE FAMILY DWELLINGS,1,5212,69,A1,469 E 25TH ST,,11226.0,1.0,0.0,1.0,4000.0,2680.0,1899.0,1,A1,125000,2019-01-02
111,3,OTHER,01 ONE FAMILY DWELLINGS,1,7930,121,A5,5521 WHITTY LANE,,11203.0,1.0,0.0,1.0,1710.0,1872.0,1940.0,1,A5,620000,2019-01-02


In [629]:
nyc_single_fam_df.describe()

Unnamed: 0,BLOCK,LOT,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,SALE_PRICE
count,3151.0,3151.0,3151.0,3151.0,3151.0,3135.0,3151.0,3151.0,3151.0,3151.0
mean,6917.976515,75.981593,0.987623,0.015868,1.003491,5376.0689,1470.306887,1943.6947,1.0,628560.1
std,3963.326705,161.089514,0.113414,0.127499,0.171789,21156.50023,586.3392,26.676786,0.0,292990.4
min,21.0,1.0,0.0,0.0,0.0,0.0,0.0,1890.0,1.0,104000.0
25%,4016.0,21.0,1.0,0.0,1.0,1549.0,1144.0,1925.0,1.0,447500.0
50%,6301.0,42.0,1.0,0.0,1.0,2500.0,1360.0,1938.0,1.0,568000.0
75%,10208.5,69.0,1.0,0.0,1.0,4000.0,1683.0,1955.0,1.0,760000.0
max,16350.0,2720.0,2.0,2.0,3.0,529759.0,7875.0,2018.0,1.0,1955000.0


In [630]:
# for the future I will write a function for this... but I am almost done now...

In [631]:
nyc_single_fam_df['TAX_CLASS_AT_TIME_OF_SALE'].nunique()

1

In [632]:
nyc_single_fam_df['TAX_CLASS_AT_TIME_OF_SALE'].value_counts()

1    3151
Name: TAX_CLASS_AT_TIME_OF_SALE, dtype: int64

In [633]:
# this is also meaningless TAX_CLASS_AT_TIME_OF_SALE
nyc_single_fam_df = nyc_single_fam_df.drop("TAX_CLASS_AT_TIME_OF_SALE", axis=1)
nyc_single_fam_df.shape

(3151, 19)

In [634]:
nyc_single_fam_df['LOT'].nunique()

332

In [635]:
nyc_single_fam_df['YEAR_BUILT'].nunique()

89

In [636]:
nyc_single_fam_df['BLOCK'].nunique()

2496

In [637]:
nyc_single_fam_df['BLOCK'].value_counts()

16350    21
1272      7
5506      6
6022      6
5735      6
         ..
569       1
4667      1
5730      1
6720      1
6145      1
Name: BLOCK, Length: 2496, dtype: int64

In [638]:
# my gut tells me this might be on the edge of getting dropped.. 
# But i'll leave it for now

In [639]:
nyc_single_fam_df['APARTMENT_NUMBER'].nunique()

1

In [640]:
nyc_single_fam_df["TOTAL_UNITS"].nunique()

4

In [641]:
nyc_single_fam_df["COMMERCIAL_UNITS"].nunique()

3

In [642]:
nyc_single_fam_df["COMMERCIAL_UNITS"].value_counts()

0.0    3102
1.0      48
2.0       1
Name: COMMERCIAL_UNITS, dtype: int64

In [643]:
nyc_single_fam_df["RESIDENTIAL_UNITS"].nunique()

3

In [644]:
nyc_single_fam_df["RESIDENTIAL_UNITS"].value_counts()

1.0    3110
0.0      40
2.0       1
Name: RESIDENTIAL_UNITS, dtype: int64

In [645]:
nyc_single_fam_df = nyc_single_fam_df.drop('APARTMENT_NUMBER', axis=1)
nyc_single_fam_df.head(2)

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,BUILDING_CLASS_AT_PRESENT,ADDRESS,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
44,3,OTHER,01 ONE FAMILY DWELLINGS,1,5495,801,A9,4832 BAY PARKWAY,11230.0,1.0,0.0,1.0,6800.0,1325.0,1930.0,A9,550000,2019-01-01
61,4,OTHER,01 ONE FAMILY DWELLINGS,1,7918,72,A1,80-23 232ND STREET,11427.0,1.0,0.0,1.0,4000.0,2001.0,1940.0,A1,200000,2019-01-01


In [646]:
nyc_single_fam_df.shape

(3151, 18)

In [647]:
nyc_single_fam_df.describe(exclude="number")

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,ZIP_CODE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_DATE
count,3151.0,3151,3151,3151.0,3151,3151,3151.0,3151,3151
unique,5.0,7,1,2.0,13,3135,125.0,11,91
top,4.0,OTHER,01 ONE FAMILY DWELLINGS,1.0,A1,125-27 LUCAS STREET,10306.0,A1,2019-01-31 00:00:00
freq,1580.0,2959,3151,3111.0,1185,2,127.0,1186,78
first,,,,,,,,,2019-01-01 00:00:00
last,,,,,,,,,2019-04-30 00:00:00


In [648]:
# of course drop category because the subsetting made it meaningless...
nyc_single_fam_df = nyc_single_fam_df.drop("BUILDING_CLASS_CATEGORY", axis=1)
nyc_single_fam_df.shape

(3151, 17)

In [649]:
nyc_single_fam_df["NEIGHBORHOOD"].value_counts()

OTHER                 2959
FLUSHING-NORTH          97
EAST NEW YORK           31
FOREST HILLS            22
BOROUGH PARK            19
ASTORIA                 14
BEDFORD STUYVESANT       9
Name: NEIGHBORHOOD, dtype: int64

In [650]:
nyc_single_fam_df["TAX_CLASS_AT_PRESENT"].value_counts()


1     3111
1D      40
Name: TAX_CLASS_AT_PRESENT, dtype: int64

In [651]:
nyc_single_fam_df["BOROUGH"].value_counts()

4    1580
5     738
3     537
2     293
1       3
Name: BOROUGH, dtype: int64

In [652]:
nyc_single_fam_df["ZIP_CODE"].value_counts()

10306.0    127
10312.0    124
10314.0    113
11434.0    100
11234.0     90
          ... 
10038.0      0
10037.0      0
10036.0      0
10035.0      0
0.0          0
Name: ZIP_CODE, Length: 184, dtype: int64

In [653]:
# some zip codes have a count of zeor

In [654]:
nyc_single_fam_df.columns

Index(['BOROUGH', 'NEIGHBORHOOD', 'TAX_CLASS_AT_PRESENT', 'BLOCK', 'LOT',
       'BUILDING_CLASS_AT_PRESENT', 'ADDRESS', 'ZIP_CODE', 'RESIDENTIAL_UNITS',
       'COMMERCIAL_UNITS', 'TOTAL_UNITS', 'LAND_SQUARE_FEET',
       'GROSS_SQUARE_FEET', 'YEAR_BUILT', 'BUILDING_CLASS_AT_TIME_OF_SALE',
       'SALE_PRICE', 'SALE_DATE'],
      dtype='object')

## **The Feature Matrix Plan:** 
Keep feature count below 400(Curse of Dimensionality rule of thumb)



## Feature Selection 

##### Numeric      

1.   RESIDENTIAL_UNITS,
2.   COMMERCIAL_UNITS,
3.   GROSS_SQUARE_FEET,
4.   YEAR_BUILT,


#### Categorical 

1.   TAX_CLASS_AT_PRESENT - expand to 2
2.   BOROUGH - expands to 5 
3.   NEIGHBORHOOD - expands to 7 
4.   ZIP_CODE - expands to 125
5.   BUILDING_CLASS_AT_TIME_OF_SALE - expands to 11
6.   SALE_DATE

---






In [655]:
data = {"RESIDENTIAL_UNITS": nyc_single_fam_df["RESIDENTIAL_UNITS"], 
        "COMMERCIAL_UNITS": nyc_single_fam_df["COMMERCIAL_UNITS"],
        "GROSS_SQUARE_FEET": nyc_single_fam_df["GROSS_SQUARE_FEET"],
        "YEAR_BUILT": nyc_single_fam_df["YEAR_BUILT"],
        "TAX_CLASS_AT_PRESENT": nyc_single_fam_df["TAX_CLASS_AT_PRESENT"],
        "BOROUGH": nyc_single_fam_df["BOROUGH"],
        "NEIGHBORHOOD": nyc_single_fam_df["NEIGHBORHOOD"], 
        "ZIP_CODE": nyc_single_fam_df["ZIP_CODE"],
        "BUILDING_CLASS_AT_TIME_OF_SALE": nyc_single_fam_df["BUILDING_CLASS_AT_TIME_OF_SALE"],
        "SALE_DATE": nyc_single_fam_df["SALE_DATE"]}

In [656]:
data_target = {"SALE_PRICE": nyc_single_fam_df["SALE_PRICE"]}

In [657]:
X_matrix = pd.DataFrame(data)

In [658]:
X_matrix.shape

(3151, 10)

In [659]:
y_target = pd.DataFrame(data_target)

In [660]:
y_target.shape

(3151, 1)

# Do train/test split. Use data from January — March 2019 to train. Use data from April 2019 to test.

In [661]:
# make sure it's only from one year
# nyc_single_fam_df["SALE_DATE"].dt.year.value_counts()
X_matrix["SALE_DATE"].dt.year.value_counts()

2019    3151
Name: SALE_DATE, dtype: int64

In [662]:
# the train / test condition

# cond_jan_mar_train = (nyc_single_fam_df['SALE_DATE'].dt.month >=1) & (nyc_single_fam_df['SALE_DATE'].dt.month <=3)
# cond_jan_mar_train

cond_jan_mar_train = (X_matrix['SALE_DATE'].dt.month >=1) & (X_matrix['SALE_DATE'].dt.month <=3)
cond_jan_mar_train

44        True
61        True
78        True
108       True
111       True
         ...  
23029    False
23031    False
23032    False
23033    False
23035    False
Name: SALE_DATE, Length: 3151, dtype: bool

In [663]:
X_train, y_train = X_matrix.loc[cond_jan_mar_train], y_target.loc[cond_jan_mar_train]
X_test, y_test = X_matrix.loc[~cond_jan_mar_train], y_target.loc[~cond_jan_mar_train]

## Establish a baseline

In [664]:
# make function that will do it for us
def get_naive_baseline(target_feature):
  guess = target_feature.mean()
  errors = guess - target_feature
  mean_absolute_error = errors.abs().mean()

  naive_baseline = namedtuple("baseline_info", ["guess", "errors", "MAE"])

  return naive_baseline(guess, errors, mean_absolute_error)

In [665]:
nbaseline = get_naive_baseline(y_target["SALE_PRICE"])


In [666]:
print(f"Baseline Price Guess: {nbaseline.guess}")
print(f"Baseline MAE: {nbaseline.MAE}")
print(nbaseline.errors.shape)

Baseline Price Guess: 628560.1126626468
Baseline MAE: 215470.57403809694
(3151,)


In [667]:
# so...
print(f"If we priced every property at ${nbaseline.guess},")
print(f"we would be off by ${nbaseline.MAE} on average.")

If we priced every property at $628560.1126626468,
we would be off by $215470.57403809694 on average.


In [668]:
# drop the datetime to see if that clears it up

X_train =X_train.drop(labels='SALE_DATE', axis=1)
X_test = X_test.drop(labels='SALE_DATE', axis=1)

## Before we build our model we will do some One Hot Encoding...

In [669]:
X_train.shape

(2507, 9)

In [670]:
import category_encoders as ce

#Instantiate the Transformer
encoder = ce.OneHotEncoder(use_cat_names=True)

# fit and Transform the training data  
XT_train = encoder.fit_transform(X_train)

# don't fit the testing data
XT_test = encoder.transform(X_test)

In [671]:
XT_train.head()

Unnamed: 0,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_PRESENT_1,TAX_CLASS_AT_PRESENT_1D,BOROUGH_3,BOROUGH_4,BOROUGH_2,BOROUGH_5,BOROUGH_1,NEIGHBORHOOD_OTHER,NEIGHBORHOOD_FLUSHING-NORTH,NEIGHBORHOOD_EAST NEW YORK,NEIGHBORHOOD_BEDFORD STUYVESANT,NEIGHBORHOOD_FOREST HILLS,NEIGHBORHOOD_BOROUGH PARK,NEIGHBORHOOD_ASTORIA,ZIP_CODE_11230.0,ZIP_CODE_11427.0,ZIP_CODE_10461.0,ZIP_CODE_11226.0,ZIP_CODE_11203.0,ZIP_CODE_11229.0,ZIP_CODE_11364.0,ZIP_CODE_11373.0,ZIP_CODE_11365.0,ZIP_CODE_11429.0,ZIP_CODE_11414.0,ZIP_CODE_11369.0,ZIP_CODE_11415.0,ZIP_CODE_11413.0,ZIP_CODE_11434.0,ZIP_CODE_11435.0,ZIP_CODE_10312.0,ZIP_CODE_10308.0,ZIP_CODE_10314.0,ZIP_CODE_11236.0,ZIP_CODE_11228.0,ZIP_CODE_11207.0,...,ZIP_CODE_10470.0,ZIP_CODE_11105.0,ZIP_CODE_10475.0,ZIP_CODE_11220.0,ZIP_CODE_11213.0,ZIP_CODE_11214.0,ZIP_CODE_11103.0,ZIP_CODE_11004.0,ZIP_CODE_11217.0,ZIP_CODE_11224.0,ZIP_CODE_10453.0,ZIP_CODE_11232.0,ZIP_CODE_10462.0,ZIP_CODE_11372.0,ZIP_CODE_10464.0,ZIP_CODE_10458.0,ZIP_CODE_11368.0,ZIP_CODE_11416.0,ZIP_CODE_11692.0,ZIP_CODE_10468.0,ZIP_CODE_11225.0,ZIP_CODE_11221.0,ZIP_CODE_11233.0,ZIP_CODE_10456.0,ZIP_CODE_11238.0,ZIP_CODE_10455.0,ZIP_CODE_10460.0,ZIP_CODE_11102.0,ZIP_CODE_10459.0,BUILDING_CLASS_AT_TIME_OF_SALE_A9,BUILDING_CLASS_AT_TIME_OF_SALE_A1,BUILDING_CLASS_AT_TIME_OF_SALE_A5,BUILDING_CLASS_AT_TIME_OF_SALE_A0,BUILDING_CLASS_AT_TIME_OF_SALE_A2,BUILDING_CLASS_AT_TIME_OF_SALE_A3,BUILDING_CLASS_AT_TIME_OF_SALE_S1,BUILDING_CLASS_AT_TIME_OF_SALE_A4,BUILDING_CLASS_AT_TIME_OF_SALE_A6,BUILDING_CLASS_AT_TIME_OF_SALE_A8,BUILDING_CLASS_AT_TIME_OF_SALE_S0
44,1.0,0.0,1325.0,1930.0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
61,1.0,0.0,2001.0,1940.0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
78,1.0,0.0,2043.0,1925.0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
108,1.0,0.0,2680.0,1899.0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
111,1.0,0.0,1872.0,1940.0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0


## Build Ridge Regression Model

In [672]:
# Instantiate a Ridge Regression Model object
rig_model_01 = Ridge()

# fit data to model
rig_model_01.fit(XT_train, y_train)



Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=False, random_state=None, solver='auto', tol=0.001)

In [676]:
training_MAE = mean_absolute_error(y_train, rig_model_01.predict(XT_train))
testing_MAE = mean_absolute_error(y_test, rig_model_01.predict(XT_test))


In [677]:
training_MAE

127475.69542123054

In [679]:
testing_MAE


141838.389994029

In [674]:
# This is the r2 
rig_model_01.score(XT_train, y_train)

0.6141466103758697

In [None]:
# see what model has...
rig_model_01.coef_

In [698]:
def ridge_regress_predict(XT_train, y_train, XT_val, y_val, alpha): # lecture wasn't clear on how to make a prediction!?!

  # X_matrix and y_target
  #  
  # have already split it...
  
  # then it need to instantiate a Ridge regression object
  rig_model = Ridge(alpha=alpha, normalize=True)

# then it should fit the model to the trainig data
  rig_model.fit(XT_train, y_train)

  # Then it can make predictions and store them as named tuples:
  # training_MAE, testing_MAE, 
  # now make the predictions

  # response_variable = lr.predict([correlate])
  # response_var_column = lr.predict(X_train)

  training_MAE = mean_absolute_error(y_train, rig_model.predict(XT_train))
  testing_MAE = mean_absolute_error(y_val, rig_model.predict(XT_val))

  # it can also return the coeffient and intercept
  # coefficient, intercept 

  coefficient = rig_model.coef_
  intercept = rig_model.intercept_

  # I think we can also get  𝑅^2
  # model_rmse = mean_squared_error(y_train, response_var_column, squared=False)
  model_r2 = rig_model.score(XT_train, y_train)

  model_prediction = namedtuple("prediction_info", ["training_MAE", "testing_MAE",
                                                    "model_r2_score","coefficient", 
                                                    "intercept"])
  return model_prediction(training_MAE, testing_MAE, model_r2, coefficient,
                          intercept)

In [743]:
# test this function

prediction_info = ridge_regress_predict(XT_train, y_train, XT_test, y_test, 1)

In [744]:

print(f"Model R^2: {prediction_info.model_r2_score}")

print(f"Coefficient: {prediction_info.coefficient}")
print(f"Intercept: {prediction_info.intercept}")
print(f"Training MAE: {prediction_info.training_MAE}")
print(f"Testing MAE: {prediction_info.testing_MAE}")

Model R^2: 0.5049973971836291
Coefficient: [[ 1.07298028e+04  4.79561258e+04  8.68481105e+01  1.70723901e+02
   1.07298028e+04 -1.07298028e+04  5.61766075e+04  1.27146581e+04
  -4.94468134e+04 -3.28228141e+04  3.67906621e+04 -4.68939949e+04
   5.44384725e+04 -1.04136342e+05 -4.95392367e+04  1.61965902e+05
   1.06616296e+05  9.89909609e+04  1.49193682e+05 -4.60551696e+04
  -3.03628262e+04  1.88328254e+04 -6.23224805e+04  5.88726691e+04
   1.22748037e+05  1.35036144e+05  9.23664488e+04 -6.99955859e+04
  -4.22996794e+04  3.56589699e+04  2.12142334e+05 -7.67837101e+04
  -8.12272675e+04 -1.05806599e+04 -2.12122233e+04 -8.79863278e+03
  -2.80202474e+03 -8.18761574e+04  1.41125553e+05 -3.70844227e+04
   2.43012535e+05  1.36069156e+05 -8.02734388e+03 -8.68663567e+04
   1.54271870e+05 -6.79253929e+04  6.38905704e+04  1.05529969e+05
   9.21927046e+04  3.73794370e+04  1.24656323e+05  8.29036139e+03
  -4.15284365e+04 -2.46774068e+04 -2.15106188e+04 -9.14583812e+04
  -8.15224142e+03 -7.41876262e+03

## So the model will make better predictions than the baseline. but I don't know how to actually Make a prediction.  :(

## SelectKBest Transformer

In [711]:
XT_train.shape

(2507, 151)

In [722]:
#Instantiate the Transformer
encoder = ce.OneHotEncoder(use_cat_names=True)

# fit and Transform the training data  
XT_train = encoder.fit_transform(X_train)

# don't fit the testing data
# XT_test = encoder.transform(X_test)

In [735]:
# try to select the feature that are best

#instantiate, and tell it the number of features
skb = SelectKBest(k=140)

In [736]:
XTT_train = skb.fit_transform(XT_train, y_train)

  y = column_or_1d(y, warn=True)
  f = msb / msw


In [737]:
# instantiate Predictor
model = LinearRegression()

# Fit on the TranFormed data
model.fit(XTT_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [738]:
print(model.score(XTT_train, y_train))

0.6165646446828702
