<a href="https://colab.research.google.com/github/EvidenceN/DS-Unit-2-Linear-Models/blob/master/%20module3-ridge-regression/Evidence.N.%20Answers_Assignment_regression_classification_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 3*

---

# Ridge Regression

## Assignment

We're going back to our other **New York City** real estate dataset. Instead of predicting apartment rents, you'll predict property sales prices.

But not just for condos in Tribeca...

Instead, predict property sales prices for **One Family Dwellings** (`BUILDING_CLASS_CATEGORY` == `'01 ONE FAMILY DWELLINGS'`). 

Use a subset of the data where the **sale price was more than \\$100 thousand and less than $2 million.** 

The [NYC Department of Finance](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page) has a glossary of property sales terms and NYC Building Class Code Descriptions. The data comes from the [NYC OpenData](https://data.cityofnewyork.us/browse?q=NYC%20calendar%20sales) portal.

- [ ] Do train/test split. Use data from January — March 2019 to train. Use data from April 2019 to test.
- [ ] Do one-hot encoding of categorical features.
- [ ] Do feature selection with `SelectKBest`.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Fit a ridge regression model with multiple features.
- [ ] Get mean absolute error for the test set.
- [ ] As always, commit your notebook to your fork of the GitHub repo.


## Stretch Goals
- [ ] Add your own stretch goal(s) !
- [ ] Instead of `Ridge`, try `LinearRegression`. Depending on how many features you select, your errors will probably blow up! 💥
- [ ] Instead of `Ridge`, try [`RidgeCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html).
- [ ] Learn more about feature selection:
    - ["Permutation importance"](https://www.kaggle.com/dansbecker/permutation-importance)
    - [scikit-learn's User Guide for Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html)
    - [mlxtend](http://rasbt.github.io/mlxtend/) library
    - scikit-learn-contrib libraries: [boruta_py](https://github.com/scikit-learn-contrib/boruta_py) & [stability-selection](https://github.com/scikit-learn-contrib/stability-selection)
    - [_Feature Engineering and Selection_](http://www.feat.engineering/) by Kuhn & Johnson.
- [ ] Try [statsmodels](https://www.statsmodels.org/stable/index.html) if you’re interested in more inferential statistical approach to linear regression and feature selection, looking at p values and 95% confidence intervals for the coefficients.
- [ ] Read [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf), Chapters 1-3, for more math & theory, but in an accessible, readable way.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import pandas as pd
import pandas_profiling

# Read New York City property sales data
df = pd.read_csv(DATA_PATH+'condos/NYC_Citywide_Rolling_Calendar_Sales.csv')

# Change column names: replace spaces with underscores
df.columns = [col.replace(' ', '_') for col in df]

# SALE_PRICE was read as strings.
# Remove symbols, convert to integer
df['SALE_PRICE'] = (
    df['SALE_PRICE']
    .str.replace('$','')
    .str.replace('-','')
    .str.replace(',','')
    .astype(int)
)

In [0]:
# BOROUGH is a numeric column, but arguably should be a categorical feature,
# so convert it from a number to a string
df['BOROUGH'] = df['BOROUGH'].astype(str)

In [0]:
# Reduce cardinality for NEIGHBORHOOD feature

# Get a list of the top 10 neighborhoods
top10 = df['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10, 
# replace the neighborhood with 'OTHER'
df.loc[~df['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'

In [165]:
# sale price was more than $100 thousand and less than $2 million.
df.head()

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
0,1,OTHER,13 CONDOS - ELEVATOR APARTMENTS,2,716,1246,,R4,"447 WEST 18TH STREET, PH12A",PH12A,10011.0,1.0,0.0,1.0,10733,1979.0,2007.0,2,R4,0,01/01/2019
1,1,OTHER,21 OFFICE BUILDINGS,4,812,68,,O5,144 WEST 37TH STREET,,10018.0,0.0,6.0,6.0,2962,15435.0,1920.0,4,O5,0,01/01/2019
2,1,OTHER,21 OFFICE BUILDINGS,4,839,69,,O5,40 WEST 38TH STREET,,10018.0,0.0,7.0,7.0,2074,11332.0,1930.0,4,O5,0,01/01/2019
3,1,OTHER,13 CONDOS - ELEVATOR APARTMENTS,2,592,1041,,R4,"1 SHERIDAN SQUARE, 8C",8C,10014.0,1.0,0.0,1.0,0,500.0,0.0,2,R4,0,01/01/2019
4,1,UPPER EAST SIDE (59-79),15 CONDOS - 2-10 UNIT RESIDENTIAL,2C,1379,1402,,R1,"20 EAST 65TH STREET, B",B,10065.0,1.0,0.0,1.0,0,6406.0,0.0,2,R1,0,01/01/2019


Objective is to predict the price of one family homes. One family homes is located under building class category. 

In [166]:
# create into subset where sale price was more than $100 thousand and 
# less than $2 million.
df = df[(df['SALE_PRICE']>100000) & (df['SALE_PRICE']< 2000000)]
df.head()

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
44,3,OTHER,01 ONE FAMILY DWELLINGS,1,5495,801,,A9,4832 BAY PARKWAY,,11230.0,1.0,0.0,1.0,6800,1325.0,1930.0,1,A9,550000,01/01/2019
61,4,OTHER,01 ONE FAMILY DWELLINGS,1,7918,72,,A1,80-23 232ND STREET,,11427.0,1.0,0.0,1.0,4000,2001.0,1940.0,1,A1,200000,01/01/2019
66,1,OTHER,10 COOPS - ELEVATOR APARTMENTS,2,1347,18,,D4,"345 EAST 54TH ST, 3B",,10022.0,0.0,0.0,0.0,0,0.0,1960.0,2,D4,330000,01/02/2019
67,1,UPPER EAST SIDE (79-96),10 COOPS - ELEVATOR APARTMENTS,2,1491,62,,D4,"16 EAST 80TH STREET, 2A",,10075.0,0.0,0.0,0.0,0,0.0,1925.0,2,D4,600000,01/02/2019
71,1,UPPER WEST SIDE (59-79),13 CONDOS - ELEVATOR APARTMENTS,2,1171,2200,,R4,"240 RIVERSIDE BOULEVARD, 4 F",4 F,10069.0,1.0,0.0,1.0,0,827.0,2004.0,2,R4,1250000,01/02/2019


In [0]:
# Do train/test split. Use data from January — March 2019 to train. 
# Use data from April 2019 to test.

df['SALE_DATE'] = pd.to_datetime(df['SALE_DATE'], infer_datetime_format=True)

# Training data from january to march; Testing data from April. 
train = df[df['SALE_DATE'].dt.month < 4]
test = df[df['SALE_DATE'].dt.month == 4]

In [168]:
import pandas_profiling
pandas_profiling.ProfileReport(train)

  fig = plt.figure(figsize=figsize)
  variable_stats = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)
  fig_cor, axes_cor = plt.subplots(1, 1)


0,1
Number of variables,22
Number of observations,10442
Total Missing (%),8.2%
Total size in memory,1.8 MiB
Average record size in memory,176.0 B

0,1
Numeric,11
Categorical,9
Boolean,0
Date,1
Text (Unique),0
Rejected,1
Unsupported,0

0,1
Distinct count,10442
Unique (%),100.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,9169
Minimum,44
Maximum,18147
Zeros (%),0.0%

0,1
Minimum,44.0
5-th percentile,962.05
Q1,4734.2
Median,9120.5
Q3,13760.0
95-th percentile,17339.0
Maximum,18147.0
Range,18103.0
Interquartile range,9026.0

0,1
Standard deviation,5260.5
Coef of variation,0.57373
Kurtosis,-1.2084
Mean,9169
MAD,4566.5
Skewness,-0.0064452
Sum,95742517
Variance,27673000
Memory size,81.7 KiB

Value,Count,Frequency (%),Unnamed: 3
4094,1,0.0%,
16980,1,0.0%,
1402,1,0.0%,
5496,1,0.0%,
11639,1,0.0%,
3443,1,0.0%,
17778,1,0.0%,
11631,1,0.0%,
15725,1,0.0%,
1386,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
44,1,0.0%,
61,1,0.0%,
66,1,0.0%,
67,1,0.0%,
71,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
18132,1,0.0%,
18134,1,0.0%,
18138,1,0.0%,
18144,1,0.0%,
18147,1,0.0%,

0,1
Distinct count,5
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
4,3608
3,2774
1,1840
Other values (2),2220

Value,Count,Frequency (%),Unnamed: 3
4,3608,34.6%,
3,2774,26.6%,
1,1840,17.6%,
5,1129,10.8%,
2,1091,10.4%,

0,1
Distinct count,11
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0

0,1
OTHER,8811
FLUSHING-NORTH,307
UPPER EAST SIDE (79-96),188
Other values (8),1136

Value,Count,Frequency (%),Unnamed: 3
OTHER,8811,84.4%,
FLUSHING-NORTH,307,2.9%,
UPPER EAST SIDE (79-96),188,1.8%,
UPPER EAST SIDE (59-79),185,1.8%,
BEDFORD STUYVESANT,181,1.7%,
FOREST HILLS,179,1.7%,
EAST NEW YORK,140,1.3%,
UPPER WEST SIDE (59-79),139,1.3%,
ASTORIA,129,1.2%,
BOROUGH PARK,101,1.0%,

0,1
Distinct count,40
Unique (%),0.4%
Missing (%),0.0%
Missing (n),0

0,1
01 ONE FAMILY DWELLINGS,2507
10 COOPS - ELEVATOR APARTMENTS,2135
02 TWO FAMILY DWELLINGS,1939
Other values (37),3861

Value,Count,Frequency (%),Unnamed: 3
01 ONE FAMILY DWELLINGS,2507,24.0%,
10 COOPS - ELEVATOR APARTMENTS,2135,20.4%,
02 TWO FAMILY DWELLINGS,1939,18.6%,
13 CONDOS - ELEVATOR APARTMENTS,1312,12.6%,
03 THREE FAMILY DWELLINGS,511,4.9%,
09 COOPS - WALKUP APARTMENTS,450,4.3%,
04 TAX CLASS 1 CONDOS,230,2.2%,
15 CONDOS - 2-10 UNIT RESIDENTIAL,220,2.1%,
07 RENTALS - WALKUP APARTMENTS,195,1.9%,
17 CONDO COOPS,193,1.8%,

0,1
Distinct count,10
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0

0,1
1,4949
2,4165
4,414
Other values (7),914

Value,Count,Frequency (%),Unnamed: 3
1,4949,47.4%,
2,4165,39.9%,
4,414,4.0%,
2C,322,3.1%,
2A,211,2.0%,
1A,207,2.0%,
1B,97,0.9%,
1D,31,0.3%,
1C,23,0.2%,
2B,23,0.2%,

0,1
Distinct count,5211
Unique (%),49.9%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,4638.2
Minimum,3
Maximum,16350
Zeros (%),0.0%

0,1
Minimum,3.0
5-th percentile,412.0
Q1,1459.2
Median,3900.5
Q3,6793.8
95-th percentile,12199.0
Maximum,16350.0
Range,16347.0
Interquartile range,5334.5

0,1
Standard deviation,3704.5
Coef of variation,0.79869
Kurtosis,0.28085
Mean,4638.2
MAD,3014.1
Skewness,0.94224
Sum,48432125
Variance,13723000
Memory size,81.7 KiB

Value,Count,Frequency (%),Unnamed: 3
4995,50,0.5%,
2111,30,0.3%,
3258,29,0.3%,
8489,28,0.3%,
6793,27,0.3%,
1158,25,0.2%,
1747,24,0.2%,
16,23,0.2%,
892,20,0.2%,
1179,19,0.2%,

Value,Count,Frequency (%),Unnamed: 3
3,1,0.0%,
14,1,0.0%,
15,2,0.0%,
16,23,0.2%,
17,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
16299,1,0.0%,
16305,1,0.0%,
16312,1,0.0%,
16340,4,0.0%,
16350,17,0.2%,

0,1
Distinct count,1021
Unique (%),9.8%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,336.61
Minimum,1
Maximum,9002
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,20
Median,48
Q3,272
95-th percentile,1310
Maximum,9002
Range,9001
Interquartile range,252

0,1
Standard deviation,595.61
Coef of variation,1.7694
Kurtosis,16.037
Mean,336.61
MAD,440.34
Skewness,2.9938
Sum,3514861
Variance,354750
Memory size,81.7 KiB

Value,Count,Frequency (%),Unnamed: 3
1,588,5.6%,
10,137,1.3%,
20,127,1.2%,
29,125,1.2%,
50,123,1.2%,
3,120,1.1%,
13,117,1.1%,
7,117,1.1%,
14,117,1.1%,
19,116,1.1%,

Value,Count,Frequency (%),Unnamed: 3
1,588,5.6%,
2,103,1.0%,
3,120,1.1%,
4,84,0.8%,
5,94,0.9%,

Value,Count,Frequency (%),Unnamed: 3
5298,1,0.0%,
5543,1,0.0%,
5649,1,0.0%,
6003,1,0.0%,
9002,1,0.0%,

0,1
Constant value,

0,1
Distinct count,96
Unique (%),0.9%
Missing (%),0.0%
Missing (n),0

0,1
D4,2110
R4,1314
A1,920
Other values (93),6098

Value,Count,Frequency (%),Unnamed: 3
D4,2110,20.2%,
R4,1314,12.6%,
A1,920,8.8%,
A5,780,7.5%,
B2,685,6.6%,
B1,552,5.3%,
C0,511,4.9%,
B3,465,4.5%,
C6,449,4.3%,
A2,411,3.9%,

0,1
Distinct count,10397
Unique (%),99.6%
Missing (%),0.0%
Missing (n),0

0,1
N/A BAY STREET,4
N/A HURON STREET,3
N/A ALBERT ROAD,3
Other values (10394),10432

Value,Count,Frequency (%),Unnamed: 3
N/A BAY STREET,4,0.0%,
N/A HURON STREET,3,0.0%,
N/A ALBERT ROAD,3,0.0%,
"21-66 33RD ROAD, 11A",2,0.0%,
118-20 202ND STREET,2,0.0%,
N/A SCRIBNER AVENUE,2,0.0%,
1033 EAST 225 STREET,2,0.0%,
N/A EDGEGROVE AVENUE,2,0.0%,
1006 LENOX ROAD,2,0.0%,
294 FREEBORN STREET,2,0.0%,

0,1
Distinct count,827
Unique (%),7.9%
Missing (%),79.5%
Missing (n),8306

0,1
3A,42
2A,42
4B,39
Other values (823),2013
(Missing),8306

Value,Count,Frequency (%),Unnamed: 3
3A,42,0.4%,
2A,42,0.4%,
4B,39,0.4%,
2B,37,0.4%,
1,37,0.4%,
3B,36,0.3%,
4,34,0.3%,
3,33,0.3%,
2C,29,0.3%,
2,28,0.3%,

0,1
Distinct count,182
Unique (%),1.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,10821
Minimum,0
Maximum,11697
Zeros (%),0.6%

0,1
Minimum,0
5-th percentile,10014
Q1,10310
Median,11211
Q3,11364
95-th percentile,11432
Maximum,11697
Range,11697
Interquartile range,1054

0,1
Standard deviation,1011
Coef of variation,0.093432
Kurtosis,77.604
Mean,10821
MAD,593.74
Skewness,-7.5393
Sum,112990000
Variance,1022100
Memory size,81.7 KiB

Value,Count,Frequency (%),Unnamed: 3
10314.0,198,1.9%,
11375.0,198,1.9%,
11235.0,163,1.6%,
10306.0,162,1.6%,
11234.0,160,1.5%,
10312.0,158,1.5%,
11354.0,139,1.3%,
11229.0,131,1.3%,
11236.0,124,1.2%,
10023.0,122,1.2%,

Value,Count,Frequency (%),Unnamed: 3
0.0,64,0.6%,
10001.0,15,0.1%,
10002.0,77,0.7%,
10003.0,97,0.9%,
10004.0,7,0.1%,

Value,Count,Frequency (%),Unnamed: 3
11691.0,58,0.6%,
11692.0,20,0.2%,
11693.0,19,0.2%,
11694.0,35,0.3%,
11697.0,21,0.2%,

0,1
Distinct count,30
Unique (%),0.3%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1.1603
Minimum,0
Maximum,155
Zeros (%),31.8%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,1
Q3,2
95-th percentile,3
Maximum,155
Range,155
Interquartile range,2

0,1
Standard deviation,3.2022
Coef of variation,2.7598
Kurtosis,1357.5
Mean,1.1603
MAD,0.87335
Skewness,32.747
Sum,12116
Variance,10.254
Memory size,81.7 KiB

Value,Count,Frequency (%),Unnamed: 3
1.0,4399,42.1%,
0.0,3322,31.8%,
2.0,1952,18.7%,
3.0,530,5.1%,
4.0,125,1.2%,
6.0,51,0.5%,
5.0,18,0.2%,
8.0,10,0.1%,
7.0,6,0.1%,
84.0,3,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0.0,3322,31.8%,
1.0,4399,42.1%,
2.0,1952,18.7%,
3.0,530,5.1%,
4.0,125,1.2%,

Value,Count,Frequency (%),Unnamed: 3
45.0,1,0.0%,
55.0,1,0.0%,
84.0,3,0.0%,
122.0,1,0.0%,
155.0,2,0.0%,

0,1
Distinct count,16
Unique (%),0.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.13733
Minimum,-1
Maximum,50
Zeros (%),94.6%

0,1
Minimum,-1
5-th percentile,0
Q1,0
Median,0
Q3,0
95-th percentile,1
Maximum,50
Range,51
Interquartile range,0

0,1
Standard deviation,1.3631
Coef of variation,9.9255
Kurtosis,451.92
Mean,0.13733
MAD,0.2605
Skewness,19.321
Sum,1434
Variance,1.858
Memory size,81.7 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,9879,94.6%,
1.0,452,4.3%,
2.0,36,0.3%,
3.0,18,0.2%,
14.0,12,0.1%,
29.0,10,0.1%,
12.0,8,0.1%,
22.0,5,0.0%,
4.0,5,0.0%,
8.0,5,0.0%,

Value,Count,Frequency (%),Unnamed: 3
-1.0,3,0.0%,
0.0,9879,94.6%,
1.0,452,4.3%,
2.0,36,0.3%,
3.0,18,0.2%,

Value,Count,Frequency (%),Unnamed: 3
14.0,12,0.1%,
22.0,5,0.0%,
29.0,10,0.1%,
35.0,1,0.0%,
50.0,1,0.0%,

0,1
Distinct count,35
Unique (%),0.3%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1.4868
Minimum,0
Maximum,156
Zeros (%),28.4%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,1
Q3,2
95-th percentile,3
Maximum,156
Range,156
Interquartile range,2

0,1
Standard deviation,5.3381
Coef of variation,3.5903
Kurtosis,374
Mean,1.4868
MAD,1.2728
Skewness,17.656
Sum,15525
Variance,28.495
Memory size,81.7 KiB

Value,Count,Frequency (%),Unnamed: 3
1.0,4592,44.0%,
0.0,2966,28.4%,
2.0,1952,18.7%,
3.0,583,5.6%,
4.0,135,1.3%,
6.0,60,0.6%,
5.0,34,0.3%,
8.0,23,0.2%,
7.0,11,0.1%,
36.0,11,0.1%,

Value,Count,Frequency (%),Unnamed: 3
0.0,2966,28.4%,
1.0,4592,44.0%,
2.0,1952,18.7%,
3.0,583,5.6%,
4.0,135,1.3%,

Value,Count,Frequency (%),Unnamed: 3
60.0,6,0.1%,
85.0,2,0.0%,
113.0,10,0.1%,
124.0,1,0.0%,
156.0,2,0.0%,

0,1
Distinct count,1994
Unique (%),19.1%
Missing (%),0.2%
Missing (n),25

0,1
0,4061
2000,465
2500,446
Other values (1990),5445

Value,Count,Frequency (%),Unnamed: 3
0,4061,38.9%,
2000,465,4.5%,
2500,446,4.3%,
4000,400,3.8%,
3000,181,1.7%,
1800,162,1.6%,
5000,125,1.2%,
2200,78,0.7%,
1600,64,0.6%,
2400,61,0.6%,

0,1
Distinct count,2242
Unique (%),21.5%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1596
Minimum,0
Maximum,244620
Zeros (%),29.7%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,1134
Q3,1864
95-th percentile,3360
Maximum,244620
Range,244620
Interquartile range,1864

0,1
Standard deviation,6643.7
Coef of variation,4.1629
Kurtosis,725.44
Mean,1596
MAD,1358.7
Skewness,24.455
Sum,16665000
Variance,44139000
Memory size,81.7 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,3106,29.7%,
1440.0,64,0.6%,
1600.0,50,0.5%,
1200.0,44,0.4%,
1800.0,43,0.4%,
1280.0,42,0.4%,
2000.0,42,0.4%,
2400.0,41,0.4%,
153.0,39,0.4%,
1296.0,38,0.4%,

Value,Count,Frequency (%),Unnamed: 3
0.0,3106,29.7%,
2.0,1,0.0%,
80.0,1,0.0%,
112.0,1,0.0%,
135.0,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
194030.0,2,0.0%,
217991.0,1,0.0%,
224400.0,1,0.0%,
231379.0,1,0.0%,
244619.0,1,0.0%,

0,1
Distinct count,129
Unique (%),1.2%
Missing (%),0.0%
Missing (n),5
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1839.8
Minimum,0
Maximum,2018
Zeros (%),5.7%

0,1
Minimum,0
5-th percentile,0
Q1,1924
Median,1945
Q3,1965
95-th percentile,2015
Maximum,2018
Range,2018
Interquartile range,41

0,1
Standard deviation,454.33
Coef of variation,0.24695
Kurtosis,12.396
Mean,1839.8
MAD,210.5
Skewness,-3.7815
Sum,19202000
Variance,206420
Memory size,81.7 KiB

Value,Count,Frequency (%),Unnamed: 3
1920.0,743,7.1%,
0.0,597,5.7%,
1925.0,588,5.6%,
1930.0,582,5.6%,
1950.0,457,4.4%,
1960.0,381,3.6%,
1910.0,373,3.6%,
1940.0,347,3.3%,
1955.0,270,2.6%,
1935.0,234,2.2%,

Value,Count,Frequency (%),Unnamed: 3
0.0,597,5.7%,
1800.0,3,0.0%,
1850.0,3,0.0%,
1870.0,1,0.0%,
1880.0,4,0.0%,

Value,Count,Frequency (%),Unnamed: 3
2014.0,65,0.6%,
2015.0,173,1.7%,
2016.0,194,1.9%,
2017.0,179,1.7%,
2018.0,108,1.0%,

0,1
Distinct count,3
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1.5712
Minimum,1
Maximum,4
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,1
Median,1
Q3,2
95-th percentile,2
Maximum,4
Range,3
Interquartile range,1

0,1
Standard deviation,0.69489
Coef of variation,0.44228
Kurtosis,3.2186
Mean,1.5712
MAD,0.58045
Skewness,1.5176
Sum,16406
Variance,0.48287
Memory size,81.7 KiB

Value,Count,Frequency (%),Unnamed: 3
1,5306,50.8%,
2,4722,45.2%,
4,414,4.0%,

Value,Count,Frequency (%),Unnamed: 3
1,5306,50.8%,
2,4722,45.2%,
4,414,4.0%,

Value,Count,Frequency (%),Unnamed: 3
1,5306,50.8%,
2,4722,45.2%,
4,414,4.0%,

0,1
Distinct count,96
Unique (%),0.9%
Missing (%),0.0%
Missing (n),0

0,1
D4,2110
R4,1312
A1,919
Other values (93),6101

Value,Count,Frequency (%),Unnamed: 3
D4,2110,20.2%,
R4,1312,12.6%,
A1,919,8.8%,
A5,779,7.5%,
B2,683,6.5%,
B1,552,5.3%,
C0,511,4.9%,
B3,464,4.4%,
C6,449,4.3%,
A2,413,4.0%,

0,1
Distinct count,2348
Unique (%),22.5%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,707970
Minimum,100044
Maximum,1999877
Zeros (%),0.0%

0,1
Minimum,100044
5-th percentile,205000
Q1,410000
Median,625000
Q3,900000
95-th percentile,1550000
Maximum,1999877
Range,1899833
Interquartile range,490000

0,1
Standard deviation,402010
Coef of variation,0.56783
Kurtosis,0.45144
Mean,707970
MAD,315590
Skewness,0.9632
Sum,7392662870
Variance,161610000000
Memory size,81.7 KiB

Value,Count,Frequency (%),Unnamed: 3
450000,99,0.9%,
550000,95,0.9%,
650000,93,0.9%,
750000,91,0.9%,
600000,89,0.9%,
800000,89,0.9%,
700000,84,0.8%,
400000,83,0.8%,
500000,81,0.8%,
950000,75,0.7%,

Value,Count,Frequency (%),Unnamed: 3
100044,1,0.0%,
101400,1,0.0%,
102083,1,0.0%,
102500,1,0.0%,
103640,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
1990000,1,0.0%,
1995000,3,0.0%,
1995770,1,0.0%,
1999000,1,0.0%,
1999877,1,0.0%,

0,1
Distinct count,80
Unique (%),0.8%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Minimum,2019-01-01 00:00:00
Maximum,2019-03-30 00:00:00

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
44,3,OTHER,01 ONE FAMILY DWELLINGS,1,5495,801,,A9,4832 BAY PARKWAY,,11230.0,1.0,0.0,1.0,6800,1325.0,1930.0,1,A9,550000,2019-01-01
61,4,OTHER,01 ONE FAMILY DWELLINGS,1,7918,72,,A1,80-23 232ND STREET,,11427.0,1.0,0.0,1.0,4000,2001.0,1940.0,1,A1,200000,2019-01-01
66,1,OTHER,10 COOPS - ELEVATOR APARTMENTS,2,1347,18,,D4,"345 EAST 54TH ST, 3B",,10022.0,0.0,0.0,0.0,0,0.0,1960.0,2,D4,330000,2019-01-02
67,1,UPPER EAST SIDE (79-96),10 COOPS - ELEVATOR APARTMENTS,2,1491,62,,D4,"16 EAST 80TH STREET, 2A",,10075.0,0.0,0.0,0.0,0,0.0,1925.0,2,D4,600000,2019-01-02
71,1,UPPER WEST SIDE (59-79),13 CONDOS - ELEVATOR APARTMENTS,2,1171,2200,,R4,"240 RIVERSIDE BOULEVARD, 4 F",4 F,10069.0,1.0,0.0,1.0,0,827.0,2004.0,2,R4,1250000,2019-01-02


In [169]:
print(train.shape)
train.dtypes

(10442, 21)


BOROUGH                                   object
NEIGHBORHOOD                              object
BUILDING_CLASS_CATEGORY                   object
TAX_CLASS_AT_PRESENT                      object
BLOCK                                      int64
LOT                                        int64
EASE-MENT                                float64
BUILDING_CLASS_AT_PRESENT                 object
ADDRESS                                   object
APARTMENT_NUMBER                          object
ZIP_CODE                                 float64
RESIDENTIAL_UNITS                        float64
COMMERCIAL_UNITS                         float64
TOTAL_UNITS                              float64
LAND_SQUARE_FEET                          object
GROSS_SQUARE_FEET                        float64
YEAR_BUILT                               float64
TAX_CLASS_AT_TIME_OF_SALE                  int64
BUILDING_CLASS_AT_TIME_OF_SALE            object
SALE_PRICE                                 int64
SALE_DATE           

In [170]:
#  Objective: Do one-hot encoding of categorical features.

# finding high cardinality categories. 

train.select_dtypes(exclude='number').describe().T.sort_values(by='unique')

Unnamed: 0,count,unique,top,freq,first,last
BOROUGH,10442,5,4,3608,NaT,NaT
TAX_CLASS_AT_PRESENT,10442,10,1,4949,NaT,NaT
NEIGHBORHOOD,10442,11,OTHER,8811,NaT,NaT
BUILDING_CLASS_CATEGORY,10442,40,01 ONE FAMILY DWELLINGS,2507,NaT,NaT
SALE_DATE,10442,80,2019-01-31 00:00:00,285,2019-01-01,2019-03-30
BUILDING_CLASS_AT_PRESENT,10442,96,D4,2110,NaT,NaT
BUILDING_CLASS_AT_TIME_OF_SALE,10442,96,D4,2110,NaT,NaT
APARTMENT_NUMBER,2136,826,2A,42,NaT,NaT
LAND_SQUARE_FEET,10417,1993,0,4061,NaT,NaT
ADDRESS,10442,10397,N/A BAY STREET,4,NaT,NaT


In [0]:
# defining the target
target = 'SALE_PRICE'

# Exclude high cardinality features. Limit chosen because of the feature we are
# interested in which is building_class_category. 

max_cardinality = 41 # High cardinality limit

# specify high cardinality columns.
high_cardinality = [col for col in train.select_dtypes(exclude='number') if 
                    df[col].nunique() > max_cardinality]
# or high_cardinality = ['list of column names', 'more names']

# dropping high cardinality. # drop ease-ment because it is mostly NAN's
features = train.columns.drop([target] + high_cardinality + ['EASE-MENT'])

# assigning training and testing x and y features
x_train = train[features]
y_train = train[target]
x_test =  test[features]
y_test = test[target]

In [172]:
print(x_train.shape)
x_train.head()
# it dropped 8 columns. 

(10442, 13)


Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE
44,3,OTHER,01 ONE FAMILY DWELLINGS,1,5495,801,11230.0,1.0,0.0,1.0,1325.0,1930.0,1
61,4,OTHER,01 ONE FAMILY DWELLINGS,1,7918,72,11427.0,1.0,0.0,1.0,2001.0,1940.0,1
66,1,OTHER,10 COOPS - ELEVATOR APARTMENTS,2,1347,18,10022.0,0.0,0.0,0.0,0.0,1960.0,2
67,1,UPPER EAST SIDE (79-96),10 COOPS - ELEVATOR APARTMENTS,2,1491,62,10075.0,0.0,0.0,0.0,0.0,1925.0,2
71,1,UPPER WEST SIDE (59-79),13 CONDOS - ELEVATOR APARTMENTS,2,1171,2200,10069.0,1.0,0.0,1.0,827.0,2004.0,2


In [0]:
import category_encoders as ce

# instantiate category encoder to get one hot encoder. 
# link for future reference: https://contrib.scikit-learn.org/categorical-encoding/onehot.html
# category encoder link: https://contrib.scikit-learn.org/categorical-encoding/

encoder = ce.OneHotEncoder(use_cat_names = True)
x_train = encoder.fit_transform(x_train)
x_test = encoder.transform(x_test)

In [174]:
x_train.head()

Unnamed: 0,BOROUGH_3,BOROUGH_4,BOROUGH_1,BOROUGH_2,BOROUGH_5,NEIGHBORHOOD_OTHER,NEIGHBORHOOD_UPPER EAST SIDE (79-96),NEIGHBORHOOD_UPPER WEST SIDE (59-79),NEIGHBORHOOD_BEDFORD STUYVESANT,NEIGHBORHOOD_EAST NEW YORK,NEIGHBORHOOD_ASTORIA,NEIGHBORHOOD_FLUSHING-NORTH,NEIGHBORHOOD_GRAMERCY,NEIGHBORHOOD_UPPER EAST SIDE (59-79),NEIGHBORHOOD_BOROUGH PARK,NEIGHBORHOOD_FOREST HILLS,BUILDING_CLASS_CATEGORY_01 ONE FAMILY DWELLINGS,BUILDING_CLASS_CATEGORY_10 COOPS - ELEVATOR APARTMENTS,BUILDING_CLASS_CATEGORY_13 CONDOS - ELEVATOR APARTMENTS,BUILDING_CLASS_CATEGORY_02 TWO FAMILY DWELLINGS,BUILDING_CLASS_CATEGORY_03 THREE FAMILY DWELLINGS,BUILDING_CLASS_CATEGORY_41 TAX CLASS 4 - OTHER,BUILDING_CLASS_CATEGORY_08 RENTALS - ELEVATOR APARTMENTS,BUILDING_CLASS_CATEGORY_44 CONDO PARKING,BUILDING_CLASS_CATEGORY_07 RENTALS - WALKUP APARTMENTS,BUILDING_CLASS_CATEGORY_47 CONDO NON-BUSINESS STORAGE,BUILDING_CLASS_CATEGORY_05 TAX CLASS 1 VACANT LAND,BUILDING_CLASS_CATEGORY_43 CONDO OFFICE BUILDINGS,BUILDING_CLASS_CATEGORY_09 COOPS - WALKUP APARTMENTS,BUILDING_CLASS_CATEGORY_04 TAX CLASS 1 CONDOS,BUILDING_CLASS_CATEGORY_17 CONDO COOPS,BUILDING_CLASS_CATEGORY_29 COMMERCIAL GARAGES,BUILDING_CLASS_CATEGORY_15 CONDOS - 2-10 UNIT RESIDENTIAL,BUILDING_CLASS_CATEGORY_12 CONDOS - WALKUP APARTMENTS,BUILDING_CLASS_CATEGORY_27 FACTORIES,BUILDING_CLASS_CATEGORY_22 STORE BUILDINGS,BUILDING_CLASS_CATEGORY_31 COMMERCIAL VACANT LAND,BUILDING_CLASS_CATEGORY_45 CONDO HOTELS,BUILDING_CLASS_CATEGORY_32 HOSPITAL AND HEALTH FACILITIES,BUILDING_CLASS_CATEGORY_14 RENTALS - 4-10 UNIT,BUILDING_CLASS_CATEGORY_06 TAX CLASS 1 - OTHER,BUILDING_CLASS_CATEGORY_46 CONDO STORE BUILDINGS,BUILDING_CLASS_CATEGORY_21 OFFICE BUILDINGS,BUILDING_CLASS_CATEGORY_37 RELIGIOUS FACILITIES,BUILDING_CLASS_CATEGORY_33 EDUCATIONAL FACILITIES,BUILDING_CLASS_CATEGORY_30 WAREHOUSES,BUILDING_CLASS_CATEGORY_11A CONDO-RENTALS,BUILDING_CLASS_CATEGORY_36 OUTDOOR RECREATIONAL FACILITIES,BUILDING_CLASS_CATEGORY_34 THEATRES,BUILDING_CLASS_CATEGORY_16 CONDOS - 2-10 UNIT WITH COMMERCIAL UNIT,BUILDING_CLASS_CATEGORY_26 OTHER HOTELS,BUILDING_CLASS_CATEGORY_28 COMMERCIAL CONDOS,BUILDING_CLASS_CATEGORY_35 INDOOR PUBLIC AND CULTURAL FACILITIES,BUILDING_CLASS_CATEGORY_39 TRANSPORTATION FACILITIES,BUILDING_CLASS_CATEGORY_42 CONDO CULTURAL/MEDICAL/EDUCATIONAL/ETC,BUILDING_CLASS_CATEGORY_23 LOFT BUILDINGS,TAX_CLASS_AT_PRESENT_1,TAX_CLASS_AT_PRESENT_2,TAX_CLASS_AT_PRESENT_4,TAX_CLASS_AT_PRESENT_2A,TAX_CLASS_AT_PRESENT_1B,TAX_CLASS_AT_PRESENT_1A,TAX_CLASS_AT_PRESENT_2C,TAX_CLASS_AT_PRESENT_1C,TAX_CLASS_AT_PRESENT_2B,TAX_CLASS_AT_PRESENT_1D,BLOCK,LOT,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE
44,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,5495,801,11230.0,1.0,0.0,1.0,1325.0,1930.0,1
61,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,7918,72,11427.0,1.0,0.0,1.0,2001.0,1940.0,1
66,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1347,18,10022.0,0.0,0.0,0.0,0.0,1960.0,2
67,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1491,62,10075.0,0.0,0.0,0.0,0.0,1925.0,2
71,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1171,2200,10069.0,1.0,0.0,1.0,827.0,2004.0,2


In [175]:
x_train.shape

(10442, 75)

In [176]:
x_train.isna().sum().sort_values(ascending=False)[:10]
x_test.isna().sum().sort_values(ascending=False)[:10]
# year built has 5 nan values in x_train and 17 nan values in x_test
# i need to drop them in order to perform selectkbest

YEAR_BUILT                                                  17
BUILDING_CLASS_CATEGORY_43 CONDO OFFICE BUILDINGS            0
BUILDING_CLASS_CATEGORY_03 THREE FAMILY DWELLINGS            0
BUILDING_CLASS_CATEGORY_41 TAX CLASS 4 - OTHER               0
BUILDING_CLASS_CATEGORY_08 RENTALS - ELEVATOR APARTMENTS     0
BUILDING_CLASS_CATEGORY_44 CONDO PARKING                     0
BUILDING_CLASS_CATEGORY_07 RENTALS - WALKUP APARTMENTS       0
BUILDING_CLASS_CATEGORY_47 CONDO NON-BUSINESS STORAGE        0
BUILDING_CLASS_CATEGORY_05 TAX CLASS 1 VACANT LAND           0
BUILDING_CLASS_CATEGORY_09 COOPS - WALKUP APARTMENTS         0
dtype: int64

In [0]:
# replace nan values with the mean of the column
x_train['YEAR_BUILT'] = x_train['YEAR_BUILT'].fillna(value =x_train['YEAR_BUILT'].mean() )
x_test['YEAR_BUILT'] = x_test['YEAR_BUILT'].fillna(value =x_test['YEAR_BUILT'].mean() )

In [0]:
# Do feature selection with SelectKBest

from sklearn.feature_selection import f_regression, SelectKBest

# There are 75 features, select 15 to begin with. 
selector = SelectKBest(score_func=f_regression, k=15)

# .fit_transform training data, then .transform testing data
x_train_selected = selector.fit_transform(x_train, y_train)
x_test_selected = selector.transform(x_test)

In [179]:
# which features where selected? This is for selecting individual features
names = x_train.columns
selected_features = selector.get_support()
selected_features_names = names[selected_features]
unselected_features_names = names[~selected_features]

print('Selected Features:')
for name in selected_features_names:
  print(name)

print('\nNot Selected Features:')
for name in unselected_features_names:
  print(name)

Selected Features:
BOROUGH_3
BOROUGH_4
BOROUGH_1
BOROUGH_2
BOROUGH_5
NEIGHBORHOOD_OTHER
BUILDING_CLASS_CATEGORY_01 ONE FAMILY DWELLINGS
BUILDING_CLASS_CATEGORY_10 COOPS - ELEVATOR APARTMENTS
BUILDING_CLASS_CATEGORY_13 CONDOS - ELEVATOR APARTMENTS
BUILDING_CLASS_CATEGORY_03 THREE FAMILY DWELLINGS
BUILDING_CLASS_CATEGORY_07 RENTALS - WALKUP APARTMENTS
BUILDING_CLASS_CATEGORY_09 COOPS - WALKUP APARTMENTS
TAX_CLASS_AT_PRESENT_2
TAX_CLASS_AT_PRESENT_2A
BLOCK

Not Selected Features:
NEIGHBORHOOD_UPPER EAST SIDE (79-96)
NEIGHBORHOOD_UPPER WEST SIDE (59-79)
NEIGHBORHOOD_BEDFORD STUYVESANT
NEIGHBORHOOD_EAST NEW YORK
NEIGHBORHOOD_ASTORIA
NEIGHBORHOOD_FLUSHING-NORTH
NEIGHBORHOOD_GRAMERCY
NEIGHBORHOOD_UPPER EAST SIDE (59-79)
NEIGHBORHOOD_BOROUGH PARK
NEIGHBORHOOD_FOREST HILLS
BUILDING_CLASS_CATEGORY_02 TWO FAMILY DWELLINGS
BUILDING_CLASS_CATEGORY_41 TAX CLASS 4 - OTHER
BUILDING_CLASS_CATEGORY_08 RENTALS - ELEVATOR APARTMENTS
BUILDING_CLASS_CATEGORY_44 CONDO PARKING
BUILDING_CLASS_CATEGORY_47 CONDO

In [180]:
# for selecting a model that has the "right" number of features
# the model(number of features) that presents itself with the lowest mean absolute error
# is the the best number of features(k) to use for this model. 

from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import LinearRegression

for k in range(1, len(x_train.columns)+1):
  print(f'{k} features')

  # do SelectKBest fit and transform
  selector = SelectKBest(score_func=f_regression, k=k)
  x_train_selected = selector.fit_transform(x_train, y_train)
  x_test_selected = selector.transform(x_test)

  # Do linear regression on the selected dataframes
  model = LinearRegression()
  model.fit(x_train_selected, y_train)

  # then predict the y which is price using the selected x_test features
  y_pred = model.predict(x_test_selected)
  error = mean_absolute_error(y_test, y_pred)
  print(f"Test Mean Absolute Error: ${error:,.0f}")

1 features
Test Mean Absolute Error: $312,606
2 features
Test Mean Absolute Error: $310,840
3 features
Test Mean Absolute Error: $282,638
4 features
Test Mean Absolute Error: $282,638
5 features
Test Mean Absolute Error: $276,889
6 features
Test Mean Absolute Error: $272,028
7 features
Test Mean Absolute Error: $269,495
8 features
Test Mean Absolute Error: $266,395
9 features
Test Mean Absolute Error: $265,021
10 features
Test Mean Absolute Error: $265,021
11 features
Test Mean Absolute Error: $251,900
12 features
Test Mean Absolute Error: $251,857
13 features
Test Mean Absolute Error: $251,370
14 features
Test Mean Absolute Error: $251,394
15 features
Test Mean Absolute Error: $246,984
16 features
Test Mean Absolute Error: $244,485
17 features
Test Mean Absolute Error: $244,428
18 features
Test Mean Absolute Error: $244,408
19 features
Test Mean Absolute Error: $243,460
20 features
Test Mean Absolute Error: $243,360
21 features
Test Mean Absolute Error: $243,336
22 features
Test Mean 

In [181]:
# feature scaling and ridge regression along with its error values. 

'''Ridge Regression is a technique for analyzing multiple regression data 
that suffer from multicollinearity. When multicollinearity occurs, 
least squares estimates are unbiased, but their variances are large 
so they may be far from the true value.'''

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from IPython.display import display, HTML
from ipywidgets import interact
import matplotlib.pyplot as plt

# we will start by trying different alpha values for ridge regression

for alpha in [10**1, 10**2, 10**3, 10**4, 10**5, 10**6, 10**7, 10**8, 10**9]:

  # first standardize the data. we could also use 'ridge(normalize=True)' instead
  # of standardizing first. This is mostly if you want to standardize on your own
  # before fitting with ridge regression. 
  # standardization improves the condition of the problem and reduce variance. 

  scaler = StandardScaler()
  x_train_scaled = scaler.fit_transform(x_train)
  x_test_scaled = scaler.transform(x_test)

  # fitting the ridge regression model. 
  display(HTML(f'Ridge Regression, with alpha={alpha}'))
  model = Ridge(alpha=alpha)
  model.fit(x_train_scaled, y_train)

  # get the mean absolute error of the model for training data
  y_pred = model.predict(x_train_scaled)
  error = mean_absolute_error(y_train, y_pred)
  display(HTML(f'Train Mean Absolute Error: ${error:,.0f}'))

  # get the mean absolute error of the model for training data
  y_pred = model.predict(x_test_scaled)
  error = mean_absolute_error(y_test, y_pred)
  display(HTML(f'Test Mean Absolute Error: ${error:,.0f}'))

  # let't plot the coefficients from the ridge regression model. 
  # is not working for me and i don't know why. 
  coefficients = pd.Series(model.coef_, x_train.columns)
  plt.figure(figsize=(16,8))
  coefficients.sort_values().plot.barh(color='red')
  plt.xlim(-400, 1000)
  plt.show()

















