<a href="https://colab.research.google.com/github/Logan-Stark/DS-Unit-2-Linear-Models/blob/master/module3-ridge-regression/LS_DS_213_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 3*

---

# Ridge Regression

## Assignment

We're going back to our other **New York City** real estate dataset. Instead of predicting apartment rents, you'll predict property sales prices.

But not just for condos in Tribeca...

- [ ] Use a subset of the data where `BUILDING_CLASS_CATEGORY` == `'01 ONE FAMILY DWELLINGS'` and the sale price was more than 100 thousand and less than 2 million.
- [ ] Do train/test split. Use data from January — March 2019 to train. Use data from April 2019 to test.
- [ ] Do one-hot encoding of categorical features.
- [ ] Do feature selection with `SelectKBest`.
- [ ] Fit a ridge regression model with multiple features. Use the `normalize=True` parameter (or do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html) beforehand — use the scaler's `fit_transform` method with the train set, and the scaler's `transform` method with the test set)
- [ ] Get mean absolute error for the test set.
- [ ] As always, commit your notebook to your fork of the GitHub repo.

The [NYC Department of Finance](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page) has a glossary of property sales terms and NYC Building Class Code Descriptions. The data comes from the [NYC OpenData](https://data.cityofnewyork.us/browse?q=NYC%20calendar%20sales) portal.


## Stretch Goals

Don't worry, you aren't expected to do all these stretch goals! These are just ideas to consider and choose from.

- [ ] Add your own stretch goal(s) !
- [ ] Instead of `Ridge`, try `LinearRegression`. Depending on how many features you select, your errors will probably blow up! 💥
- [ ] Instead of `Ridge`, try [`RidgeCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html).
- [ ] Learn more about feature selection:
    - ["Permutation importance"](https://www.kaggle.com/dansbecker/permutation-importance)
    - [scikit-learn's User Guide for Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html)
    - [mlxtend](http://rasbt.github.io/mlxtend/) library
    - scikit-learn-contrib libraries: [boruta_py](https://github.com/scikit-learn-contrib/boruta_py) & [stability-selection](https://github.com/scikit-learn-contrib/stability-selection)
    - [_Feature Engineering and Selection_](http://www.feat.engineering/) by Kuhn & Johnson.
- [ ] Try [statsmodels](https://www.statsmodels.org/stable/index.html) if you’re interested in more inferential statistical approach to linear regression and feature selection, looking at p values and 95% confidence intervals for the coefficients.
- [ ] Read [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf), Chapters 1-3, for more math & theory, but in an accessible, readable way.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [None]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [None]:
import pandas as pd
import pandas_profiling

# Read New York City property sales data
df = pd.read_csv(DATA_PATH+'condos/NYC_Citywide_Rolling_Calendar_Sales.csv')

# Change column names: replace spaces with underscores
df.columns = [col.replace(' ', '_') for col in df]

# SALE_PRICE was read as strings.
# Remove symbols, convert to integer
df['SALE_PRICE'] = (
    df['SALE_PRICE']
    .str.replace('$','')
    .str.replace('-','')
    .str.replace(',','')
    .astype(int)
)

In [None]:
# BOROUGH is a numeric column, but arguably should be a categorical feature,
# so convert it from a number to a string
df['BOROUGH'] = df['BOROUGH'].astype(str)

In [None]:
# Reduce cardinality for NEIGHBORHOOD feature

# Get a list of the top 10 neighborhoods
top10 = df['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10, 
# replace the neighborhood with 'OTHER'
df.loc[~df['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'

# Data Exploration
- Cleaning and analyzing our data

In [None]:
# taking a look to get an idea of our data using the .head() method
# I can see that Appartment number and ease-ment have a very high number of NaNs
# Sale date needs to be converted to datetime format

df.head()


Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
0,1,OTHER,13 CONDOS - ELEVATOR APARTMENTS,2,716,1246,,R4,"447 WEST 18TH STREET, PH12A",PH12A,10011.0,1.0,0.0,1.0,10733,1979.0,2007.0,2,R4,0,01/01/2019
1,1,OTHER,21 OFFICE BUILDINGS,4,812,68,,O5,144 WEST 37TH STREET,,10018.0,0.0,6.0,6.0,2962,15435.0,1920.0,4,O5,0,01/01/2019
2,1,OTHER,21 OFFICE BUILDINGS,4,839,69,,O5,40 WEST 38TH STREET,,10018.0,0.0,7.0,7.0,2074,11332.0,1930.0,4,O5,0,01/01/2019
3,1,OTHER,13 CONDOS - ELEVATOR APARTMENTS,2,592,1041,,R4,"1 SHERIDAN SQUARE, 8C",8C,10014.0,1.0,0.0,1.0,0,500.0,0.0,2,R4,0,01/01/2019
4,1,UPPER EAST SIDE (59-79),15 CONDOS - 2-10 UNIT RESIDENTIAL,2C,1379,1402,,R1,"20 EAST 65TH STREET, B",B,10065.0,1.0,0.0,1.0,0,6406.0,0.0,2,R1,0,01/01/2019


In [None]:
# Confirming above suspicions

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23040 entries, 0 to 23039
Data columns (total 21 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   BOROUGH                         23040 non-null  object 
 1   NEIGHBORHOOD                    23040 non-null  object 
 2   BUILDING_CLASS_CATEGORY         23040 non-null  object 
 3   TAX_CLASS_AT_PRESENT            23039 non-null  object 
 4   BLOCK                           23040 non-null  int64  
 5   LOT                             23040 non-null  int64  
 6   EASE-MENT                       0 non-null      float64
 7   BUILDING_CLASS_AT_PRESENT       23039 non-null  object 
 8   ADDRESS                         23040 non-null  object 
 9   APARTMENT_NUMBER                5201 non-null   object 
 10  ZIP_CODE                        23039 non-null  float64
 11  RESIDENTIAL_UNITS               23039 non-null  float64
 12  COMMERCIAL_UNITS                

#### Cleaning our dataset
- Dropping EASE-MENT and APPARTMENT_NUMBER columns
- Dropping NaN values with df.dropna
- Changing SALE_DATE to datetime format

In [None]:
# Dropping EASE-MENT column from our dataframe

df = df.drop('EASE-MENT', axis= 1)

In [None]:
# Dropping APARTMENT_NUMBER column from our dataframe

df = df.drop('APARTMENT_NUMBER', axis= 1)

In [None]:
# Dropping NaN values with df.dropna

df=df.dropna()

In [None]:
# Changing SALE_DATE to datetime format

df['SALE_DATE'] = pd.to_datetime(df['SALE_DATE'], infer_datetime_format=True)

# Creating new data subset
- Filtering our dataframe to only include data where BUILDING_CLASS_CATEGORY is equal to 01 ONE FAMILY DWELLINGS
- Filtering our dataframe to only include data where SALE_PRICE is greater than 100000 and less then 2000000
- Use our data subset to make training data using data from January to March of 2019
-  Use our data subset to make testing data using data from April of 2019

In [None]:
# Filtering our dataframe to only include data where BUILDING_CLASS_CATEGORY 
# is equal to 01 ONE FAMILY DWELLINGS

datasubset = df[df['BUILDING_CLASS_CATEGORY'] =='01 ONE FAMILY DWELLINGS']

In [None]:
# Checking our work

datasubset.head()

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,BUILDING_CLASS_AT_PRESENT,ADDRESS,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
7,2,OTHER,01 ONE FAMILY DWELLINGS,1,4090,37,A1,1193 SACKET AVENUE,10461.0,1.0,0.0,1.0,3404,1328.0,1925.0,1,A1,0,2019-01-01
8,2,OTHER,01 ONE FAMILY DWELLINGS,1,4120,18,A5,1215 VAN NEST AVENUE,10461.0,1.0,0.0,1.0,2042,1728.0,1935.0,1,A5,0,2019-01-01
9,2,OTHER,01 ONE FAMILY DWELLINGS,1,4120,20,A5,1211 VAN NEST AVENUE,10461.0,1.0,0.0,1.0,2042,1728.0,1935.0,1,A5,0,2019-01-01
42,3,OTHER,01 ONE FAMILY DWELLINGS,1,6809,54,A1,2601 AVENUE R,11229.0,1.0,0.0,1.0,3333,1262.0,1925.0,1,A1,0,2019-01-01
44,3,OTHER,01 ONE FAMILY DWELLINGS,1,5495,801,A9,4832 BAY PARKWAY,11230.0,1.0,0.0,1.0,6800,1325.0,1930.0,1,A9,550000,2019-01-01


In [None]:
# Checking our work

datasubset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5061 entries, 7 to 23035
Data columns (total 19 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   BOROUGH                         5061 non-null   object        
 1   NEIGHBORHOOD                    5061 non-null   object        
 2   BUILDING_CLASS_CATEGORY         5061 non-null   object        
 3   TAX_CLASS_AT_PRESENT            5061 non-null   object        
 4   BLOCK                           5061 non-null   int64         
 5   LOT                             5061 non-null   int64         
 6   BUILDING_CLASS_AT_PRESENT       5061 non-null   object        
 7   ADDRESS                         5061 non-null   object        
 8   ZIP_CODE                        5061 non-null   float64       
 9   RESIDENTIAL_UNITS               5061 non-null   float64       
 10  COMMERCIAL_UNITS                5061 non-null   float64       
 11  TOT

In [None]:
# Filtering our dataframe to only include data where SALE_PRICE
# is greater than 100000 and less then 2000000

datasubset = datasubset[datasubset['SALE_PRICE'] >= 100000]

In [None]:
# same as above

datasubset = datasubset[datasubset['SALE_PRICE'] < 2000000]

In [None]:
# Checking our work

datasubset['SALE_PRICE'].value_counts().sort_values()

936000     1
396550     1
544000     1
539900     1
417000     1
          ..
650000    36
450000    40
525000    40
550000    43
500000    48
Name: SALE_PRICE, Length: 1001, dtype: int64

In [None]:
# Checking our work

df['SALE_PRICE'].value_counts().sort_values()

1138646       1
757504        1
317163        1
3870400       1
1912500       1
           ... 
650000      120
750000      121
800000      125
10          199
0          6905
Name: SALE_PRICE, Length: 3801, dtype: int64

In [None]:
# Making a mask to help split our data by date

cutoff = pd.to_datetime('2019-04-01')

In [None]:
# Use our data subset to make training data using data from January to March of 2019

train = datasubset[datasubset.SALE_DATE < cutoff]

In [None]:
# Use our data subset to make testing data using data from April of 2019

test  = datasubset[datasubset.SALE_DATE >= cutoff]

# Perform OHE on a feature from our dataframe
- Find a good candidate for OHE using .describe() method
- Import and use OneHotEncoder from sklearn on BOROUGH

In [None]:
# Find a good candidate for OHE using .describe() method
# We will use BOROUGH as our feature 

train.describe(exclude='number')

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,LAND_SQUARE_FEET,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_DATE
count,2515.0,2515,2515,2515.0,2515,2515,2515.0,2515,2515
unique,5.0,7,1,2.0,13,2505,888.0,11,68
top,4.0,OTHER,01 ONE FAMILY DWELLINGS,1.0,A1,294 FREEBORN STREET,4000.0,A1,2019-01-31 00:00:00
freq,1209.0,2366,2515,2484.0,921,2,235.0,921,78
first,,,,,,,,,2019-01-01 00:00:00
last,,,,,,,,,2019-03-30 00:00:00


In [None]:
# Import our tool
from sklearn.preprocessing import OneHotEncoder

# Instantiate our class
ohe = OneHotEncoder()

# Fit our tool to our data 
ohe.fit(train[['BOROUGH']])

# Transforming our data and putting into an array
train_trans = ohe.transform(train[['BOROUGH']]).toarray()

# can be done in one step
# train_trans = ohe.fit_transform(train[['BOROUGH']]).toarray()

# make sure to not retrain your data. it will ruin your results

In [None]:
# Checking work

train_trans

array([[0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 1., 0., 0., 0.],
       ...,
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 1., 0.]])

# Perform SelectKbest to find the ideal features for Ridge Regression
- Construct X_train and X_test
- Construct y_train and y_test
- Perform SelectKbest using our newly constructed variables

In [None]:
# Construct X_train

X_train = train.drop('SALE_PRICE', axis= 1)

In [None]:
# Make sure our X_train only includes numerical values

X_train = X_train.select_dtypes(include='number')

In [None]:
# Construct X_test
X_test = test.drop('SALE_PRICE', axis= 1)

In [None]:
# Make sure our X_test only includes numerical values

X_test = X_test.select_dtypes(include='number')

In [None]:
# Construct y_train

y_train = train['SALE_PRICE']

In [None]:
# Construct y_test

y_test = test['SALE_PRICE']

In [None]:
# Import our tool from sklearn
from sklearn.feature_selection import SelectKBest

# Insatntiate our tool to our data
selector = SelectKBest(k=9)

# Firt our tool to our data using .fit_transform() and .transform()
X_train_selected = selector.fit_transform(X_train,y_train)
X_test_selected = selector.transform(X_test)

  f = msb / msw


In [None]:
# Constructing a mask to use with our algorithm to 
# show which features are best to use

mask = selector.get_support()

In [None]:
# Using our Mask on X_train columns to show results

X_train.columns[mask]

Index(['BLOCK', 'LOT', 'ZIP_CODE', 'RESIDENTIAL_UNITS', 'COMMERCIAL_UNITS',
       'TOTAL_UNITS', 'GROSS_SQUARE_FEET', 'YEAR_BUILT',
       'TAX_CLASS_AT_TIME_OF_SALE'],
      dtype='object')

# Perform a Ridge Regression using multiple features
- Use RidgeCV to find apropriate alpha for our data.
- Import, Instantiate, and Fit our model to our data
- Use different metrics to check the accuracy of our model

In [None]:
# Prrovide a range of alphas for RidgeCV to use

alphas = [0.01, 0.1, 1.0, 10.0, 100.0]

In [None]:
# Using RidgeCV to find the apropriate alpha
# Importing our tool
from sklearn.linear_model import RidgeCV

# Instantiate our tool
ridge = RidgeCV(alphas=alphas, normalize=True)

# Fit tool to our data
ridge.fit(X_train, y_train)
ridge.alpha_

0.01

In [None]:
# Now that we know our appropriate alpha is .01 we can perform our Ridge Regression

# Import our tools
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import Ridge

# Instantiate our tool
model = Ridge(alpha = .01, normalize=True)
 
# fit regression model
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Get our test MAE
y_pred = model.predict(X_train)
mae = mean_absolute_error(y_train, y_pred)
print('Training MAE:', mae)

# Check our error metrics : test set
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print('Test MAE:', mae)

Training MAE: 168032.17333562273
Test MAE: 168124.4017698454
