Lambda School Data Science

*Unit 2, Sprint 1, Module 3*

---

# Ridge Regression

## Assignment

We're going back to our other **New York City** real estate dataset. Instead of predicting apartment rents, you'll predict property sales prices.

But not just for condos in Tribeca...

- [ ] Use a subset of the data where `BUILDING_CLASS_CATEGORY` == `'01 ONE FAMILY DWELLINGS'` and the sale price was more than 100 thousand and less than 2 million.
- [ ] Do train/test split. Use data from January — March 2019 to train. Use data from April 2019 to test.
- [ ] Do one-hot encoding of categorical features.
- [ ] Do feature selection with `SelectKBest`.
- [ ] Fit a ridge regression model with multiple features. Use the `normalize=True` parameter (or do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html) beforehand — use the scaler's `fit_transform` method with the train set, and the scaler's `transform` method with the test set)
- [ ] Get mean absolute error for the test set.
- [ ] As always, commit your notebook to your fork of the GitHub repo.

The [NYC Department of Finance](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page) has a glossary of property sales terms and NYC Building Class Code Descriptions. The data comes from the [NYC OpenData](https://data.cityofnewyork.us/browse?q=NYC%20calendar%20sales) portal.


## Stretch Goals

Don't worry, you aren't expected to do all these stretch goals! These are just ideas to consider and choose from.

- [ ] Add your own stretch goal(s) !
- [ ] Instead of `Ridge`, try `LinearRegression`. Depending on how many features you select, your errors will probably blow up! 💥
- [ ] Instead of `Ridge`, try [`RidgeCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html).
- [ ] Learn more about feature selection:
    - ["Permutation importance"](https://www.kaggle.com/dansbecker/permutation-importance)
    - [scikit-learn's User Guide for Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html)
    - [mlxtend](http://rasbt.github.io/mlxtend/) library
    - scikit-learn-contrib libraries: [boruta_py](https://github.com/scikit-learn-contrib/boruta_py) & [stability-selection](https://github.com/scikit-learn-contrib/stability-selection)
    - [_Feature Engineering and Selection_](http://www.feat.engineering/) by Kuhn & Johnson.
- [ ] Try [statsmodels](https://www.statsmodels.org/stable/index.html) if you’re interested in more inferential statistical approach to linear regression and feature selection, looking at p values and 95% confidence intervals for the coefficients.
- [ ] Read [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf), Chapters 1-3, for more math & theory, but in an accessible, readable way.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [2]:
import pandas as pd
import pandas_profiling

# Read New York City property sales data
df = pd.read_csv(DATA_PATH+'condos/NYC_Citywide_Rolling_Calendar_Sales.csv')

# Change column names: replace spaces with underscores
df.columns = [col.replace(' ', '_') for col in df]

# SALE_PRICE was read as strings.
# Remove symbols, convert to integer
df['SALE_PRICE'] = (
    df['SALE_PRICE']
    .str.replace('$','')
    .str.replace('-','')
    .str.replace(',','')
    .astype(int)
)

In [3]:
# BOROUGH is a numeric column, but arguably should be a categorical feature,
# so convert it from a number to a string
df['BOROUGH'] = df['BOROUGH'].astype(str)

In [4]:
# Reduce cardinality for NEIGHBORHOOD feature

# Get a list of the top 10 neighborhoods
top10 = df['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10, 
# replace the neighborhood with 'OTHER'
df.loc[~df['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'

In [5]:
df.head()

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
0,1,OTHER,13 CONDOS - ELEVATOR APARTMENTS,2,716,1246,,R4,"447 WEST 18TH STREET, PH12A",PH12A,10011.0,1.0,0.0,1.0,10733,1979.0,2007.0,2,R4,0,01/01/2019
1,1,OTHER,21 OFFICE BUILDINGS,4,812,68,,O5,144 WEST 37TH STREET,,10018.0,0.0,6.0,6.0,2962,15435.0,1920.0,4,O5,0,01/01/2019
2,1,OTHER,21 OFFICE BUILDINGS,4,839,69,,O5,40 WEST 38TH STREET,,10018.0,0.0,7.0,7.0,2074,11332.0,1930.0,4,O5,0,01/01/2019
3,1,OTHER,13 CONDOS - ELEVATOR APARTMENTS,2,592,1041,,R4,"1 SHERIDAN SQUARE, 8C",8C,10014.0,1.0,0.0,1.0,0,500.0,0.0,2,R4,0,01/01/2019
4,1,UPPER EAST SIDE (59-79),15 CONDOS - 2-10 UNIT RESIDENTIAL,2C,1379,1402,,R1,"20 EAST 65TH STREET, B",B,10065.0,1.0,0.0,1.0,0,6406.0,0.0,2,R1,0,01/01/2019


In [None]:
## created my subset

In [116]:
df['BUILDING_CLASS_CATEGORY'].value_counts().head(1)

01 ONE FAMILY DWELLINGS    5061
Name: BUILDING_CLASS_CATEGORY, dtype: int64

In [117]:
df1 = df[df["BUILDING_CLASS_CATEGORY"] == '01 ONE FAMILY DWELLINGS']

In [118]:
df2 = df1[df1["SALE_PRICE"] > 100000]

In [119]:
df2 = df2[df2["SALE_PRICE"] < 1000000]

In [None]:
#dropped all nan columns

In [120]:
df2 = df2.drop(['EASE-MENT','APARTMENT_NUMBER'], axis=1)

In [33]:
df2.isnull().sum()

BOROUGH                           0
NEIGHBORHOOD                      0
BUILDING_CLASS_CATEGORY           0
TAX_CLASS_AT_PRESENT              0
BLOCK                             0
LOT                               0
BUILDING_CLASS_AT_PRESENT         0
ADDRESS                           0
ZIP_CODE                          0
RESIDENTIAL_UNITS                 0
COMMERCIAL_UNITS                  0
TOTAL_UNITS                       0
LAND_SQUARE_FEET                  0
GROSS_SQUARE_FEET                 0
YEAR_BUILT                        0
TAX_CLASS_AT_TIME_OF_SALE         0
BUILDING_CLASS_AT_TIME_OF_SALE    0
SALE_PRICE                        0
SALE_DATE                         0
dtype: int64

In [34]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2894 entries, 44 to 23035
Data columns (total 19 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   BOROUGH                         2894 non-null   object 
 1   NEIGHBORHOOD                    2894 non-null   object 
 2   BUILDING_CLASS_CATEGORY         2894 non-null   object 
 3   TAX_CLASS_AT_PRESENT            2894 non-null   object 
 4   BLOCK                           2894 non-null   int64  
 5   LOT                             2894 non-null   int64  
 6   BUILDING_CLASS_AT_PRESENT       2894 non-null   object 
 7   ADDRESS                         2894 non-null   object 
 8   ZIP_CODE                        2894 non-null   float64
 9   RESIDENTIAL_UNITS               2894 non-null   float64
 10  COMMERCIAL_UNITS                2894 non-null   float64
 11  TOTAL_UNITS                     2894 non-null   float64
 12  LAND_SQUARE_FEET                

In [None]:
#Cearted a month column and droped saledate

In [122]:
df2['SALE_DATE'] = pd.to_datetime(df2['SALE_DATE'], errors='coerce')

In [123]:
df2['month'] = df2['SALE_DATE'].dt.month

In [101]:
df2 = df2.drop(['SALE_DATE'], axis=1)

In [None]:
# ran a test split

In [124]:
y = df2['SALE_PRICE']
y.shape

(2894,)

In [125]:
X = df2.drop('SALE_PRICE', axis=1)
X.shape

(2894, 19)

In [126]:
X_train = X[X['month'] < 4]
y_train = y[y.index.isin(X_train.index)]

In [127]:
mask = X['month'] < 4

In [128]:
X.query('month < 4')

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,BUILDING_CLASS_AT_PRESENT,ADDRESS,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_DATE,month
44,3,OTHER,01 ONE FAMILY DWELLINGS,1,5495,801,A9,4832 BAY PARKWAY,11230.0,1.0,0.0,1.0,6800,1325.0,1930.0,1,A9,2019-01-01,1
61,4,OTHER,01 ONE FAMILY DWELLINGS,1,7918,72,A1,80-23 232ND STREET,11427.0,1.0,0.0,1.0,4000,2001.0,1940.0,1,A1,2019-01-01,1
78,2,OTHER,01 ONE FAMILY DWELLINGS,1,4210,19,A1,1260 RHINELANDER AVE,10461.0,1.0,0.0,1.0,3500,2043.0,1925.0,1,A1,2019-01-02,1
108,3,OTHER,01 ONE FAMILY DWELLINGS,1,5212,69,A1,469 E 25TH ST,11226.0,1.0,0.0,1.0,4000,2680.0,1899.0,1,A1,2019-01-02,1
111,3,OTHER,01 ONE FAMILY DWELLINGS,1,7930,121,A5,5521 WHITTY LANE,11203.0,1.0,0.0,1.0,1710,1872.0,1940.0,1,A5,2019-01-02,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18129,5,OTHER,01 ONE FAMILY DWELLINGS,1,4081,44,A2,10 SEAFOAM STREET,10306.0,1.0,0.0,1.0,2400,921.0,1950.0,1,A2,2019-03-29,3
18130,5,OTHER,01 ONE FAMILY DWELLINGS,1,2373,201,A5,74 MCVEIGH AVE,10314.0,1.0,0.0,1.0,2450,2128.0,1980.0,1,A5,2019-03-29,3
18132,5,OTHER,01 ONE FAMILY DWELLINGS,1,1132,42,A1,479 VILLA AVENUE,10302.0,1.0,0.0,1.0,4361,1807.0,2018.0,1,A1,2019-03-29,3
18134,5,OTHER,01 ONE FAMILY DWELLINGS,1,3395,37,A2,63 NUGENT AVENUE,10305.0,1.0,0.0,1.0,6000,621.0,1930.0,1,A2,2019-03-29,3


In [129]:
X_test = X[X['month'] >= 4]
y_test = y[y.index.isin(X_test.index)]

In [130]:
# Get mean baseline
y_train.mean()

559717.6594454073

In [131]:
# Train Error
from sklearn.metrics import mean_absolute_error

y_pred = [y_train.mean()] * len(y_train)

In [132]:
mae = mean_absolute_error(y_train, y_pred)
print(f'Train Error (01-03 salesprice): {mae:.2f} total dollars')

Train Error (01-03 salesprice): 158708.29 total dollars


In [133]:
# Test Error
y_pred = [y_train.mean()] * len(y_test)
mae = mean_absolute_error(y_test, y_pred)
print(f'Test Error (04 salesprice): {mae:.2f} total dollars')

Test Error (04 salesprice): 155685.48 total dollars


In [113]:
import pandas as pd
import plotly.express as px

px.scatter(
    X_train,
    x='GROSS_SQUARE_FEET',
    y=y_train,
    text='month',
    #title='US Presidential Elections, 1952-2004',
    trendline='ols',  # Ordinary Least Squares
)

In [135]:
X_train.head()

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,BUILDING_CLASS_AT_PRESENT,ADDRESS,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_DATE,month
44,3,OTHER,01 ONE FAMILY DWELLINGS,1,5495,801,A9,4832 BAY PARKWAY,11230.0,1.0,0.0,1.0,6800,1325.0,1930.0,1,A9,2019-01-01,1
61,4,OTHER,01 ONE FAMILY DWELLINGS,1,7918,72,A1,80-23 232ND STREET,11427.0,1.0,0.0,1.0,4000,2001.0,1940.0,1,A1,2019-01-01,1
78,2,OTHER,01 ONE FAMILY DWELLINGS,1,4210,19,A1,1260 RHINELANDER AVE,10461.0,1.0,0.0,1.0,3500,2043.0,1925.0,1,A1,2019-01-02,1
108,3,OTHER,01 ONE FAMILY DWELLINGS,1,5212,69,A1,469 E 25TH ST,11226.0,1.0,0.0,1.0,4000,2680.0,1899.0,1,A1,2019-01-02,1
111,3,OTHER,01 ONE FAMILY DWELLINGS,1,7930,121,A5,5521 WHITTY LANE,11203.0,1.0,0.0,1.0,1710,1872.0,1940.0,1,A5,2019-01-02,1


In [137]:
y_train

44       550000
61       200000
78       810000
108      125000
111      620000
          ...  
18129    330000
18130    690000
18132    610949
18134    520000
18147    104000
Name: SALE_PRICE, Length: 2308, dtype: int64

In [136]:
X_train['NEIGHBORHOOD'].value_counts().head()

OTHER                 2210
FLUSHING-NORTH          70
BOROUGH PARK            10
FOREST HILLS             8
BEDFORD STUYVESANT       6
Name: NEIGHBORHOOD, dtype: int64

In [138]:
X_train[['NEIGHBORHOOD']].head()

Unnamed: 0,NEIGHBORHOOD
44,OTHER
61,OTHER
78,OTHER
108,OTHER
111,OTHER


In [143]:
# Import the class
from sklearn.preprocessing import OneHotEncoder

# Instantiate
ohe = OneHotEncoder(sparse=False)

# Fit the transformer to the data
ohe.fit(X[['NEIGHBORHOOD']])

# Transform your data
train_trans = ohe.transform(X[['NEIGHBORHOOD']])

# DO NOT RETRAIN YOUR TRANSFORMER ON YOUR TESTING DATA

In [144]:
train_trans

array([[0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 1.],
       ...,
       [0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 1.]])

In [158]:
X.shape

(2894, 19)

In [159]:
# How many features do we have currently?
features = X_train.columns
n = len(features)
n

20

In [160]:
# How many ways to choose 1 to n features?
from math import factorial

def n_choose_k(n, k):
    return factorial(n)/(factorial(k)*factorial(n-k))

combinations = sum(n_choose_k(n,k) for k in range(1,n+1))
print(f'{combinations:,.0f}')

1,048,575


In [161]:
y_train

44       550000
61       200000
78       810000
108      125000
111      620000
          ...  
18129    330000
18130    690000
18132    610949
18134    520000
18147    104000
Name: SALE_PRICE, Length: 2308, dtype: int64

In [174]:
X_test.info('"')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 586 entries, 18235 to 23035
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   BLOCK                      586 non-null    int64  
 1   LOT                        586 non-null    int64  
 2   ZIP_CODE                   586 non-null    float64
 3   RESIDENTIAL_UNITS          586 non-null    float64
 4   COMMERCIAL_UNITS           586 non-null    float64
 5   TOTAL_UNITS                586 non-null    float64
 6   GROSS_SQUARE_FEET          586 non-null    float64
 7   YEAR_BUILT                 586 non-null    float64
 8   TAX_CLASS_AT_TIME_OF_SALE  586 non-null    int64  
 9   month                      586 non-null    int64  
 10  perk_count                 586 non-null    float64
dtypes: float64(7), int64(4)
memory usage: 54.9 KB


In [185]:
586 + 2308

2894

In [186]:
y.shape

(2894,)

In [192]:
df2['SALE_PRICE']

44       550000
61       200000
78       810000
108      125000
111      620000
          ...  
23029    635000
23031    514000
23032    635000
23033    545000
23035    510000
Name: SALE_PRICE, Length: 2894, dtype: int64

In [193]:
y_train = df2['SALE_PRICE']
y_test = df2['SALE_PRICE']
X_train = X_train.select_dtypes(include='number')
X_test = X_test.select_dtypes(include='number')

In [209]:
X_train.shape

(2308, 11)

In [206]:
# TODO: Select the 15 features that best correlate with the target
# (15 is an arbitrary starting point here)

# SelectKBest has a similar API to what we've seen before.
# IMPORTANT!
# .fit_transform on the train set
# .transform on test set

from sklearn.feature_selection import SelectKBest
selector = SelectKBest(k=5)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

ValueError: ignored

In [147]:
df2.columns

Index(['BOROUGH', 'NEIGHBORHOOD', 'BUILDING_CLASS_CATEGORY',
       'TAX_CLASS_AT_PRESENT', 'BLOCK', 'LOT', 'BUILDING_CLASS_AT_PRESENT',
       'ADDRESS', 'ZIP_CODE', 'RESIDENTIAL_UNITS', 'COMMERCIAL_UNITS',
       'TOTAL_UNITS', 'LAND_SQUARE_FEET', 'GROSS_SQUARE_FEET', 'YEAR_BUILT',
       'TAX_CLASS_AT_TIME_OF_SALE', 'BUILDING_CLASS_AT_TIME_OF_SALE',
       'SALE_PRICE', 'SALE_DATE', 'month'],
      dtype='object')

In [148]:
df2.head(1)

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,BUILDING_CLASS_AT_PRESENT,ADDRESS,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE,month
44,3,OTHER,01 ONE FAMILY DWELLINGS,1,5495,801,A9,4832 BAY PARKWAY,11230.0,1.0,0.0,1.0,6800,1325.0,1930.0,1,A9,550000,2019-01-01,1


In [150]:
def engineer_features(X):
    
    # Avoid SettingWithCopyWarning
    X = X.copy()
    
    # How many total perks does each apartment have?
    perk_cols = ['TAX_CLASS_AT_PRESENT', 'RESIDENTIAL_UNITS', 'COMMERCIAL_UNITS',
       'TOTAL_UNITS', 'TAX_CLASS_AT_TIME_OF_SALE',]
    X['perk_count'] = X[perk_cols].sum(axis=1)

    # Are cats or dogs allowed?
    #X['cats_or_dogs'] = (X['cats_allowed']==1) | (X['dogs_allowed']==1)

    # Are cats and dogs allowed?
    #X['cats_and_dogs'] = (X['cats_allowed']==1) & (X['dogs_allowed']==1)

    # Total number of rooms (beds + baths)
    #X['rooms'] = X['bedrooms'] + X['bathrooms']

    return X

X_train = engineer_features(X_train)
X_test = engineer_features(X_test)