<a href="https://colab.research.google.com/github/ik-okoro/DS-Unit-2-Linear-Models/blob/master/module3-ridge-regression/LS_DS_213_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 3*

---

# Ridge Regression

## Assignment

We're going back to our other **New York City** real estate dataset. Instead of predicting apartment rents, you'll predict property sales prices.

But not just for condos in Tribeca...

- [ ] Use a subset of the data where `BUILDING_CLASS_CATEGORY` == `'01 ONE FAMILY DWELLINGS'` and the sale price was more than 100 thousand and less than 2 million.
- [ ] Do train/test split. Use data from January — March 2019 to train. Use data from April 2019 to test.
- [ ] Do one-hot encoding of categorical features.
- [ ] Do feature selection with `SelectKBest`.
- [ ] Fit a ridge regression model with multiple features. Use the `normalize=True` parameter (or do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html) beforehand — use the scaler's `fit_transform` method with the train set, and the scaler's `transform` method with the test set)
- [ ] Get mean absolute error for the test set.
- [ ] As always, commit your notebook to your fork of the GitHub repo.

The [NYC Department of Finance](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page) has a glossary of property sales terms and NYC Building Class Code Descriptions. The data comes from the [NYC OpenData](https://data.cityofnewyork.us/browse?q=NYC%20calendar%20sales) portal.


## Stretch Goals

Don't worry, you aren't expected to do all these stretch goals! These are just ideas to consider and choose from.

- [ ] Add your own stretch goal(s) !
- [ ] Instead of `Ridge`, try `LinearRegression`. Depending on how many features you select, your errors will probably blow up! 💥
- [ ] Instead of `Ridge`, try [`RidgeCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html).
- [ ] Learn more about feature selection:
    - ["Permutation importance"](https://www.kaggle.com/dansbecker/permutation-importance)
    - [scikit-learn's User Guide for Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html)
    - [mlxtend](http://rasbt.github.io/mlxtend/) library
    - scikit-learn-contrib libraries: [boruta_py](https://github.com/scikit-learn-contrib/boruta_py) & [stability-selection](https://github.com/scikit-learn-contrib/stability-selection)
    - [_Feature Engineering and Selection_](http://www.feat.engineering/) by Kuhn & Johnson.
- [ ] Try [statsmodels](https://www.statsmodels.org/stable/index.html) if you’re interested in more inferential statistical approach to linear regression and feature selection, looking at p values and 95% confidence intervals for the coefficients.
- [ ] Read [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf), Chapters 1-3, for more math & theory, but in an accessible, readable way.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [11]:
import numpy as np
import pandas as pd
import pandas_profiling

# Read New York City property sales data
df = pd.read_csv(DATA_PATH+'condos/NYC_Citywide_Rolling_Calendar_Sales.csv', parse_dates=[-1], index_col=[-1])

# Change column names: replace spaces with underscores
df.columns = [col.replace(' ', '_') for col in df]

# SALE_PRICE was read as strings.
# Remove symbols, convert to integer
df['SALE_PRICE'] = (
    df['SALE_PRICE']
    .str.replace('$','')
    .str.replace('-','')
    .str.replace(',','')
    .astype(int)
)

In [12]:
# BOROUGH is a numeric column, but arguably should be a categorical feature,
# so convert it from a number to a string
df['BOROUGH'] = df['BOROUGH'].astype(str)

In [13]:
# Reduce cardinality for NEIGHBORHOOD feature

# Get a list of the top 10 neighborhoods
top10 = df['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10, 
# replace the neighborhood with 'OTHER'
df.loc[~df['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'

In [14]:
df.head()

Unnamed: 0_level_0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE
SALE DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2019-01-01,1,OTHER,13 CONDOS - ELEVATOR APARTMENTS,2,716,1246,,R4,"447 WEST 18TH STREET, PH12A",PH12A,10011.0,1.0,0.0,1.0,10733,1979.0,2007.0,2,R4,0
2019-01-01,1,OTHER,21 OFFICE BUILDINGS,4,812,68,,O5,144 WEST 37TH STREET,,10018.0,0.0,6.0,6.0,2962,15435.0,1920.0,4,O5,0
2019-01-01,1,OTHER,21 OFFICE BUILDINGS,4,839,69,,O5,40 WEST 38TH STREET,,10018.0,0.0,7.0,7.0,2074,11332.0,1930.0,4,O5,0
2019-01-01,1,OTHER,13 CONDOS - ELEVATOR APARTMENTS,2,592,1041,,R4,"1 SHERIDAN SQUARE, 8C",8C,10014.0,1.0,0.0,1.0,0,500.0,0.0,2,R4,0
2019-01-01,1,UPPER EAST SIDE (59-79),15 CONDOS - 2-10 UNIT RESIDENTIAL,2C,1379,1402,,R1,"20 EAST 65TH STREET, B",B,10065.0,1.0,0.0,1.0,0,6406.0,0.0,2,R1,0


In [15]:
df.columns

Index(['BOROUGH', 'NEIGHBORHOOD', 'BUILDING_CLASS_CATEGORY',
       'TAX_CLASS_AT_PRESENT', 'BLOCK', 'LOT', 'EASE-MENT',
       'BUILDING_CLASS_AT_PRESENT', 'ADDRESS', 'APARTMENT_NUMBER', 'ZIP_CODE',
       'RESIDENTIAL_UNITS', 'COMMERCIAL_UNITS', 'TOTAL_UNITS',
       'LAND_SQUARE_FEET', 'GROSS_SQUARE_FEET', 'YEAR_BUILT',
       'TAX_CLASS_AT_TIME_OF_SALE', 'BUILDING_CLASS_AT_TIME_OF_SALE',
       'SALE_PRICE'],
      dtype='object')

In [17]:
df["BUILDING_CLASS_CATEGORY"].describe()

count                       23040
unique                         44
top       01 ONE FAMILY DWELLINGS
freq                         5061
Name: BUILDING_CLASS_CATEGORY, dtype: object

In [18]:
len(df)

23040

In [19]:
# Subset data
df = df[(df["BUILDING_CLASS_CATEGORY"] == "01 ONE FAMILY DWELLINGS") & ((df["SALE_PRICE"] > 100000) | (df["SALE_PRICE"] < 2000000))]

In [20]:
len(df)

5061

In [24]:
df.dtypes

BOROUGH                            object
NEIGHBORHOOD                       object
BUILDING_CLASS_CATEGORY            object
TAX_CLASS_AT_PRESENT               object
BLOCK                               int64
LOT                                 int64
EASE-MENT                         float64
BUILDING_CLASS_AT_PRESENT          object
ADDRESS                            object
APARTMENT_NUMBER                   object
ZIP_CODE                          float64
RESIDENTIAL_UNITS                 float64
COMMERCIAL_UNITS                  float64
TOTAL_UNITS                       float64
LAND_SQUARE_FEET                   object
GROSS_SQUARE_FEET                 float64
YEAR_BUILT                        float64
TAX_CLASS_AT_TIME_OF_SALE           int64
BUILDING_CLASS_AT_TIME_OF_SALE     object
SALE_PRICE                          int64
dtype: object

In [48]:
df['ZIP_CODE'].value_counts()

10312.0    180
10306.0    170
10314.0    167
11234.0    146
11434.0    146
          ... 
11106.0      1
10026.0      1
10027.0      1
11102.0      1
0.0          1
Name: ZIP_CODE, Length: 151, dtype: int64

Zip code should be an object but it's going to be dropped anyways so don't bother converting

In [50]:
df['EASE-MENT'].value_counts(dropna=False)

NaN    5061
Name: EASE-MENT, dtype: int64

Drop easement as well

In [51]:
df['BLOCK'].value_counts(dropna=False)

16350    23
16340     8
6022      7
1272      7
5514      7
         ..
902       1
4996      1
898       1
7041      1
6151      1
Name: BLOCK, Length: 3658, dtype: int64

In [52]:
df['LOT'].value_counts(dropna=False)

20      89
1       87
19      82
21      75
31      75
        ..
2720     1
657      1
621      1
1139     1
593      1
Name: LOT, Length: 389, dtype: int64

Don't bother changing lot and block variables before dropping

In [26]:
df.select_dtypes("object").head()

Unnamed: 0_level_0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,LAND_SQUARE_FEET,BUILDING_CLASS_AT_TIME_OF_SALE
SALE DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2019-01-01,2,OTHER,01 ONE FAMILY DWELLINGS,1,A1,1193 SACKET AVENUE,,3404,A1
2019-01-01,2,OTHER,01 ONE FAMILY DWELLINGS,1,A5,1215 VAN NEST AVENUE,,2042,A5
2019-01-01,2,OTHER,01 ONE FAMILY DWELLINGS,1,A5,1211 VAN NEST AVENUE,,2042,A5
2019-01-01,3,OTHER,01 ONE FAMILY DWELLINGS,1,A1,2601 AVENUE R,,3333,A1
2019-01-01,3,OTHER,01 ONE FAMILY DWELLINGS,1,A9,4832 BAY PARKWAY,,6800,A9


In [27]:
df.select_dtypes("object").nunique()

BOROUGH                              5
NEIGHBORHOOD                         9
BUILDING_CLASS_CATEGORY              1
TAX_CLASS_AT_PRESENT                 3
BUILDING_CLASS_AT_PRESENT           16
ADDRESS                           5001
APARTMENT_NUMBER                     2
LAND_SQUARE_FEET                  1469
BUILDING_CLASS_AT_TIME_OF_SALE      12
dtype: int64

In [30]:
df["NEIGHBORHOOD"].value_counts()

OTHER                      4727
FLUSHING-NORTH              186
FOREST HILLS                 50
BOROUGH PARK                 43
ASTORIA                      24
BEDFORD STUYVESANT           17
UPPER EAST SIDE (59-79)       9
UPPER EAST SIDE (79-96)       3
UPPER WEST SIDE (79-96)       2
Name: NEIGHBORHOOD, dtype: int64

In [32]:
df["APARTMENT_NUMBER"].value_counts(dropna=False)

NaN    5059
8         1
RP.       1
Name: APARTMENT_NUMBER, dtype: int64

Going to drop apartment number

In [33]:
df['BUILDING_CLASS_AT_PRESENT'].value_counts()

A1    1930
A5    1508
A2     788
A9     364
A0     124
S1     108
A3      99
A4      60
A8      46
A6      20
A7       5
S0       3
B2       2
V0       2
Z0       1
B3       1
Name: BUILDING_CLASS_AT_PRESENT, dtype: int64

In [None]:
# Drop building class columns and convert land square feet
df['LAND_SQUARE_FEET'] = df['LAND_SQUARE_FEET'].str.replace(",", "").astype(int)
df = df.drop(["BUILDING_CLASS_AT_PRESENT", "BUILDING_CLASS_AT_TIME_OF_SALE"], axis=1)



In [38]:
df.select_dtypes("object").nunique()

BOROUGH                       5
NEIGHBORHOOD                  9
BUILDING_CLASS_CATEGORY       1
TAX_CLASS_AT_PRESENT          3
ADDRESS                    5001
APARTMENT_NUMBER              2
dtype: int64

In [39]:
# Drop address and apartments as well
df = df.drop(["ADDRESS", "APARTMENT_NUMBER"], axis=1)

In [40]:
df.select_dtypes("object").nunique()

BOROUGH                    5
NEIGHBORHOOD               9
BUILDING_CLASS_CATEGORY    1
TAX_CLASS_AT_PRESENT       3
dtype: int64

Good with these categorical variables

In [53]:
# Drop the other wrong format columns
df = df.drop(['BLOCK', 'LOT', 'EASE-MENT', 'ZIP_CODE'], axis=1)

In [54]:
target = "SALE_PRICE"
y = df[target]
X = df.drop(target, axis=1)

In [55]:
cutoff = "2019-04-01"
mask = X.index < cutoff

X_train, y_train = X.loc[mask], y.loc[mask]
X_test, y_test = X.loc[~mask], y.loc[~mask]

In [70]:
assert len(X) == len(X_test) + len(X_train)

In [56]:
y_train.mean()

457941.81631656084

In [57]:
from sklearn.metrics import mean_absolute_error

In [58]:
mean_absolute_error(y_train, ([y_train.mean()]*len(y_train)))

379123.14777138806

In [59]:
len(X_train.columns)

11

In [61]:
X_train.columns

Index(['BOROUGH', 'NEIGHBORHOOD', 'BUILDING_CLASS_CATEGORY',
       'TAX_CLASS_AT_PRESENT', 'RESIDENTIAL_UNITS', 'COMMERCIAL_UNITS',
       'TOTAL_UNITS', 'LAND_SQUARE_FEET', 'GROSS_SQUARE_FEET', 'YEAR_BUILT',
       'TAX_CLASS_AT_TIME_OF_SALE'],
      dtype='object')

In [67]:
# Encoding
from category_encoders import OneHotEncoder

encode = OneHotEncoder(use_cat_names=True)

XT_train = encode.fit_transform(X_train)

  elif pd.api.types.is_categorical(cols):


In [68]:
len(XT_train.columns)

25

In [69]:
XT_train.columns

Index(['BOROUGH_2', 'BOROUGH_3', 'BOROUGH_4', 'BOROUGH_5', 'BOROUGH_1',
       'NEIGHBORHOOD_OTHER', 'NEIGHBORHOOD_FLUSHING-NORTH',
       'NEIGHBORHOOD_BOROUGH PARK', 'NEIGHBORHOOD_UPPER EAST SIDE (59-79)',
       'NEIGHBORHOOD_BEDFORD STUYVESANT', 'NEIGHBORHOOD_FOREST HILLS',
       'NEIGHBORHOOD_ASTORIA', 'NEIGHBORHOOD_UPPER EAST SIDE (79-96)',
       'NEIGHBORHOOD_UPPER WEST SIDE (79-96)',
       'BUILDING_CLASS_CATEGORY_01 ONE FAMILY DWELLINGS',
       'TAX_CLASS_AT_PRESENT_1', 'TAX_CLASS_AT_PRESENT_1D',
       'TAX_CLASS_AT_PRESENT_1B', 'RESIDENTIAL_UNITS', 'COMMERCIAL_UNITS',
       'TOTAL_UNITS', 'LAND_SQUARE_FEET', 'GROSS_SQUARE_FEET', 'YEAR_BUILT',
       'TAX_CLASS_AT_TIME_OF_SALE'],
      dtype='object')

In [71]:
XT_test = encode.transform(X_test)
len(XT_test.columns)

25

In [73]:
# Not advisable but perform selectkbest iteratively simple linear regression on train data then compare with test

from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SelectKBest, f_regression

for i in range(1, len(XT_train.columns) + 1):
  print(f"{i} Features")

  selector = SelectKBest(score_func = f_regression, k=i)
  XT_train_selected = selector.fit_transform(XT_train, y_train)
  XT_test_selected = selector.transform(XT_test)

  lin_reg = LinearRegression()
  lin_reg.fit(XT_train_selected, y_train)
  print(f"Test Mean Absolute Error: {mean_absolute_error(y_test, lin_reg.predict(XT_test_selected)).round(2)}")

1 Features
Test Mean Absolute Error: 389408.89
2 Features
Test Mean Absolute Error: 387725.25
3 Features
Test Mean Absolute Error: 377098.73
4 Features
Test Mean Absolute Error: 375334.94
5 Features
Test Mean Absolute Error: 375819.6
6 Features
Test Mean Absolute Error: 375786.94
7 Features
Test Mean Absolute Error: 375739.35
8 Features
Test Mean Absolute Error: 374711.55
9 Features
Test Mean Absolute Error: 375782.86
10 Features
Test Mean Absolute Error: 376005.68
11 Features
Test Mean Absolute Error: 374233.3
12 Features
Test Mean Absolute Error: 374232.01
13 Features
Test Mean Absolute Error: 374312.9
14 Features
Test Mean Absolute Error: 374400.58
15 Features
Test Mean Absolute Error: 374400.58
16 Features
Test Mean Absolute Error: 374436.83
17 Features
Test Mean Absolute Error: 374436.83
18 Features
Test Mean Absolute Error: 374392.95
19 Features


  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freed

Test Mean Absolute Error: 374392.95
20 Features
Test Mean Absolute Error: 374057.6
21 Features
Test Mean Absolute Error: 374312.11
22 Features
Test Mean Absolute Error: 374731.35
23 Features
Test Mean Absolute Error: 374731.35
24 Features
Test Mean Absolute Error: 374731.35
25 Features
Test Mean Absolute Error: 374731.35


  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)


Guess I'm using k=18?

In [98]:
selector = SelectKBest(k = 18)

XTT_train = selector.fit_transform(XT_train, y_train)
XTT_test = selector.transform(XT_test)

  f = msb / msw


In [99]:
# Use RidgeCV to find best lambda before using Ridge
from sklearn.linear_model import Ridge, RidgeCV

alphas = [0.0001, 56, 79, 0.001, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100, 500]

ridge_cv = RidgeCV(alphas = alphas, normalize = True)
ridge_cv.fit(XTT_train, y_train)
ridge_cv.alpha_

1.0

In [100]:
# Use alpha = 1 then
ridge = Ridge(normalize=True)

ridge.fit(XTT_train, y_train)

print("RIDGE train MAE:", mean_absolute_error(y_train, ridge.predict(XTT_train)))
print("RIDGE test MAE:", mean_absolute_error(y_test, ridge.predict(XTT_test)))

RIDGE train MAE: 370932.9078435294
RIDGE test MAE: 380069.933608718


In [101]:
from sklearn.metrics import r2_score

print("Training R^2:", r2_score(y_train, ridge.predict(XTT_train)))
print("Testing R^2:", r2_score(y_test, ridge.predict(XTT_test)))

Training R^2: 0.34070839246625306
Testing R^2: 0.45758760115578434
