<a href="https://colab.research.google.com/github/abdishifa234/DS-Unit-2-Linear-Models/blob/master/module3-ridge-regression/LS_DS_213_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 3*

---

# Ridge Regression

## Assignment

We're going back to our other **New York City** real estate dataset. Instead of predicting apartment rents, you'll predict property sales prices.

But not just for condos in Tribeca...

- [ ] Use a subset of the data where `BUILDING_CLASS_CATEGORY` == `'01 ONE FAMILY DWELLINGS'` and the sale price was more than 100 thousand and less than 2 million.
- [ ] Do train/test split. Use data from January — March 2019 to train. Use data from April 2019 to test.
- [ ] Do one-hot encoding of categorical features.
- [ ] Do feature selection with `SelectKBest`.
- [ ] Fit a ridge regression model with multiple features. Use the `normalize=True` parameter (or do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html) beforehand — use the scaler's `fit_transform` method with the train set, and the scaler's `transform` method with the test set)
- [ ] Get mean absolute error for the test set.
- [ ] As always, commit your notebook to your fork of the GitHub repo.

The [NYC Department of Finance](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page) has a glossary of property sales terms and NYC Building Class Code Descriptions. The data comes from the [NYC OpenData](https://data.cityofnewyork.us/browse?q=NYC%20calendar%20sales) portal.


## Stretch Goals

Don't worry, you aren't expected to do all these stretch goals! These are just ideas to consider and choose from.

- [ ] Add your own stretch goal(s) !
- [ ] Instead of `Ridge`, try `LinearRegression`. Depending on how many features you select, your errors will probably blow up! 💥
- [ ] Instead of `Ridge`, try [`RidgeCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html).
- [ ] Learn more about feature selection:
    - ["Permutation importance"](https://www.kaggle.com/dansbecker/permutation-importance)
    - [scikit-learn's User Guide for Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html)
    - [mlxtend](http://rasbt.github.io/mlxtend/) library
    - scikit-learn-contrib libraries: [boruta_py](https://github.com/scikit-learn-contrib/boruta_py) & [stability-selection](https://github.com/scikit-learn-contrib/stability-selection)
    - [_Feature Engineering and Selection_](http://www.feat.engineering/) by Kuhn & Johnson.
- [ ] Try [statsmodels](https://www.statsmodels.org/stable/index.html) if you’re interested in more inferential statistical approach to linear regression and feature selection, looking at p values and 95% confidence intervals for the coefficients.
- [ ] Read [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf), Chapters 1-3, for more math & theory, but in an accessible, readable way.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [97]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [98]:
import pandas as pd
import pandas_profiling

# Read New York City property sales data
df = pd.read_csv(DATA_PATH+'condos/NYC_Citywide_Rolling_Calendar_Sales.csv')

# Change column names: replace spaces with underscores
df.columns = [col.replace(' ', '_') for col in df]

# SALE_PRICE was read as strings.
# Remove symbols, convert to integer
df['SALE_PRICE'] = (
    df['SALE_PRICE']
    .str.replace('$','')
    .str.replace('-','')
    .str.replace(',','')
    .astype(int)
)

In [99]:
# BOROUGH is a numeric column, but arguably should be a categorical feature,
# so convert it from a number to a string
df['BOROUGH'] = df['BOROUGH'].astype(str)

In [100]:
# Reduce cardinality for NEIGHBORHOOD feature

# Get a list of the top 10 neighborhoods
top10 = df['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10, 
# replace the neighborhood with 'OTHER'
df.loc[~df['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'

###Use a subset of the data where `BUILDING_CLASS_CATEGORY` == `'01 ONE FAMILY DWELLINGS'` and the sale price was more than 100 thousand and less than 2 million.

In [101]:
#Lets select this condition ('BUILDING_CLASS_CATEGORY' == '01 ONE FAMILY DWELLINGS')
df = df[df['BUILDING_CLASS_CATEGORY'] == '01 ONE FAMILY DWELLINGS']
df.shape

(5061, 21)

In [102]:
#Lets select sale price is more than 100 thousand and less than 2 million.
df = df[(df['SALE_PRICE'] > 100000) & (df['SALE_PRICE'] < 2000000)]
df.shape

(3151, 21)

In [103]:
#Check the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3151 entries, 44 to 23035
Data columns (total 21 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   BOROUGH                         3151 non-null   object 
 1   NEIGHBORHOOD                    3151 non-null   object 
 2   BUILDING_CLASS_CATEGORY         3151 non-null   object 
 3   TAX_CLASS_AT_PRESENT            3151 non-null   object 
 4   BLOCK                           3151 non-null   int64  
 5   LOT                             3151 non-null   int64  
 6   EASE-MENT                       0 non-null      float64
 7   BUILDING_CLASS_AT_PRESENT       3151 non-null   object 
 8   ADDRESS                         3151 non-null   object 
 9   APARTMENT_NUMBER                1 non-null      object 
 10  ZIP_CODE                        3151 non-null   float64
 11  RESIDENTIAL_UNITS               3151 non-null   float64
 12  COMMERCIAL_UNITS                

In [104]:
#check for numeric column
df.describe()

Unnamed: 0,BLOCK,LOT,EASE-MENT,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,SALE_PRICE
count,3151.0,3151.0,0.0,3151.0,3151.0,3151.0,3151.0,3151.0,3151.0,3151.0,3151.0
mean,6917.976515,75.981593,,11027.219613,0.987623,0.015868,1.003491,1470.306887,1943.6947,1.0,628560.1
std,3963.326705,161.089514,,482.875284,0.113414,0.127499,0.171789,586.3392,26.676786,0.0,292990.4
min,21.0,1.0,,10030.0,0.0,0.0,0.0,0.0,1890.0,1.0,104000.0
25%,4016.0,21.0,,10461.0,1.0,0.0,1.0,1144.0,1925.0,1.0,447500.0
50%,6301.0,42.0,,11235.0,1.0,0.0,1.0,1360.0,1938.0,1.0,568000.0
75%,10208.5,69.0,,11413.0,1.0,0.0,1.0,1683.0,1955.0,1.0,760000.0
max,16350.0,2720.0,,11697.0,2.0,2.0,3.0,7875.0,2018.0,1.0,1955000.0


In [105]:
#check for categorical column
df.describe(exclude='number')

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,LAND_SQUARE_FEET,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_DATE
count,3151,3151,3151,3151,3151,3151,1,3151,3151,3151
unique,5,6,1,2,13,3135,1,1035,11,91
top,4,OTHER,01 ONE FAMILY DWELLINGS,1,A1,46-12 30TH ROAD,RP.,4000,A1,01/31/2019
freq,1580,2990,3151,3111,1185,2,1,289,1186,78


###Do train/test split. Use data from January to March 2019 to train. Use data from April 2019 to test.

In [106]:
df['SALE_DATE'].dtype

dtype('O')

In [107]:
#lets change the data types of the columns "SALE_DATE" to date time object
df['SALE_DATE'] = pd.to_datetime(df['SALE_DATE'], infer_datetime_format=True)
#check the data types again
df['SALE_DATE'].dtype

dtype('<M8[ns]')

In [108]:
#ok lets do train test split by using cutoff time '04/01/2019'
cutoff = pd.to_datetime('04/01/2019')
train = df[df['SALE_DATE'] < cutoff]
test = df[df['SALE_DATE'] >= cutoff]

In [109]:
#ok lets see the shape of the train and test set
print(train.shape, test.shape)

(2507, 21) (644, 21)


###Do one-hot encoding of categorical features.

In [110]:
df.nunique()

BOROUGH                              5
NEIGHBORHOOD                         6
BUILDING_CLASS_CATEGORY              1
TAX_CLASS_AT_PRESENT                 2
BLOCK                             2496
LOT                                332
EASE-MENT                            0
BUILDING_CLASS_AT_PRESENT           13
ADDRESS                           3135
APARTMENT_NUMBER                     1
ZIP_CODE                           125
RESIDENTIAL_UNITS                    3
COMMERCIAL_UNITS                     3
TOTAL_UNITS                          4
LAND_SQUARE_FEET                  1035
GROSS_SQUARE_FEET                 1050
YEAR_BUILT                          89
TAX_CLASS_AT_TIME_OF_SALE            1
BUILDING_CLASS_AT_TIME_OF_SALE      11
SALE_PRICE                        1000
SALE_DATE                           91
dtype: int64

###Let's see at the relationship between NEIHBORHOOD and SALE PRICE:

In [111]:
df['NEIGHBORHOOD'].value_counts()

OTHER                 2990
FLUSHING-NORTH          97
FOREST HILLS            22
BOROUGH PARK            19
ASTORIA                 14
BEDFORD STUYVESANT       9
Name: NEIGHBORHOOD, dtype: int64

In [112]:
train.groupby('NEIGHBORHOOD')['SALE_PRICE'].mean().sort_values()

NEIGHBORHOOD
OTHER                 6.056645e+05
BEDFORD STUYVESANT    6.215972e+05
FLUSHING-NORTH        8.689417e+05
ASTORIA               1.001955e+06
BOROUGH PARK          1.008917e+06
FOREST HILLS          1.210753e+06
Name: SALE_PRICE, dtype: float64

In [113]:
#filtering a high cardinality features and determing our targets
high_card_features = ['BUILDING_CLASS_CATEGORY', 'BLOCK', 'LOT', 'EASE-MENT', 'BUILDING_CLASS_AT_PRESENT', 
                      'ADDRESS', 'APARTMENT_NUMBER', 'ZIP_CODE', 'RESIDENTIAL_UNITS', 'COMMERCIAL_UNITS', 
                      'TOTAL_UNITS', 'LAND_SQUARE_FEET', 'GROSS_SQUARE_FEET', 'YEAR_BUILT',
                      'BUILDING_CLASS_AT_TIME_OF_SALE', 'SALE_PRICE', 'SALE_DATE']
                      
target = 'SALE_PRICE'

#lets drop a high cardinalilty categorical features and our targets
features = train.columns.drop([target] + high_card_features)

X_train = train[features]
y_train = train[target]
X_test = test[features]
y_test = test[target]

In [114]:
#check the shape of train and test set
print(X_train.shape,y_train.shape)
print(X_test.shape,y_test.shape)

(2507, 4) (2507,)
(644, 4) (644,)


In [115]:
#here's what X_train looks like before encoding
print(X_train.shape)
X_train.head()

(2507, 4)


Unnamed: 0,BOROUGH,NEIGHBORHOOD,TAX_CLASS_AT_PRESENT,TAX_CLASS_AT_TIME_OF_SALE
44,3,OTHER,1,1
61,4,OTHER,1,1
78,2,OTHER,1,1
108,3,OTHER,1,1
111,3,OTHER,1,1


In [116]:
#import the class/module
import category_encoders as ce

#Instantiate the class 
encoder = ce.OneHotEncoder(use_cat_names=True)

#fitting the transformer to the data
X_train = encoder.fit_transform(X_train)
X_test = encoder.transform(X_test)

In [117]:
#here's what it looks like after encoding
print(X_train.shape)
X_train.head()

(2507, 14)


Unnamed: 0,BOROUGH_3,BOROUGH_4,BOROUGH_2,BOROUGH_5,BOROUGH_1,NEIGHBORHOOD_OTHER,NEIGHBORHOOD_FLUSHING-NORTH,NEIGHBORHOOD_BEDFORD STUYVESANT,NEIGHBORHOOD_FOREST HILLS,NEIGHBORHOOD_BOROUGH PARK,NEIGHBORHOOD_ASTORIA,TAX_CLASS_AT_PRESENT_1,TAX_CLASS_AT_PRESENT_1D,TAX_CLASS_AT_TIME_OF_SALE
44,1,0,0,0,0,1,0,0,0,0,0,1,0,1
61,0,1,0,0,0,1,0,0,0,0,0,1,0,1
78,0,0,1,0,0,1,0,0,0,0,0,1,0,1
108,1,0,0,0,0,1,0,0,0,0,0,1,0,1
111,1,0,0,0,0,1,0,0,0,0,0,1,0,1


###Do feature selection with `SelectKBest`.

In [128]:
#importing the tools here
from sklearn.feature_selection import f_regression, SelectKBest
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

for i in range(1, len(X_train.columns) + 1):
  #instantiate the class
  selector = SelectKBest(score_func=f_regression, k=i)
  model = LinearRegression()

  #fiiting and transforing the data
  X_train_selected = selector.fit_transform(X_train, y_train)
  X_test_selected = selector.transform(X_test)

  #fitting the model and predicting 
  model.fit(X_train_selected, y_train)
  y_pred = model.predict(X_test_selected)

  #validating/evaluting the model
  mae = mean_absolute_error(y_test, y_pred)
  print(f'Test MAE with {i} features: {mae:.0f}')

warnings.filterwarnings(action='ignore', category=RuntimeWarning, module='sklearn')
warnings.filterwarnings(action='ignore', category=RuntimeWarning, module='scipy')

Test MAE with 1 features: 202972
Test MAE with 2 features: 203650
Test MAE with 3 features: 203738
Test MAE with 4 features: 203037
Test MAE with 5 features: 201810
Test MAE with 6 features: 202176
Test MAE with 7 features: 201712
Test MAE with 8 features: 201708
Test MAE with 9 features: 199215
Test MAE with 10 features: 199598
Test MAE with 11 features: 200175
Test MAE with 12 features: 199983
Test MAE with 13 features: 200660
Test MAE with 14 features: 200820


In [141]:
#to find the features selected and unselected:
k = 10
selector = SelectKBest(score_func=f_regression, k=k)

X_train_selected = selector.fit_transform(X_train_scaled, y_train)

all_features = X_train.columns
selected_mask = selector.get_support()
selected_names = all_features[selected_mask]
unselected_names = all_features[~selected_mask]

print('Features selected:')
for name in selected_names:
  print(name)
  
print('\nFeatures not selected:')
for name in unselected_names:
  print(name)

Features selected:
BOROUGH_3
BOROUGH_4
BOROUGH_2
BOROUGH_5
NEIGHBORHOOD_OTHER
NEIGHBORHOOD_FLUSHING-NORTH
NEIGHBORHOOD_FOREST HILLS
NEIGHBORHOOD_BOROUGH PARK
TAX_CLASS_AT_PRESENT_1
TAX_CLASS_AT_PRESENT_1D

Features not selected:
BOROUGH_1
NEIGHBORHOOD_BEDFORD STUYVESANT
NEIGHBORHOOD_ASTORIA
TAX_CLASS_AT_TIME_OF_SALE


In [129]:
#lets see the shape
X_train_selected.shape, X_test_selected.shape

((2507, 14), (644, 14))

###Fit a ridge regression model with multiple features. Use the `normalize=True` parameter (or do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html) beforehand — use the scaler's `fit_transform` method with the train set, and the scaler's `transform` method with the test set)

In [139]:
# feature scaling
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

y_train_scaled = scale(y_train)
y_test_scaled = scale(y_test)

In [140]:
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_selected, columns=X_test.columns)
y_train_scaled = pd.DataFrame(y_train_scaled)
y_test_scaled = pd.DataFrame(y_test_scaled)
X_train_scaled

Unnamed: 0,BOROUGH_3,BOROUGH_4,BOROUGH_2,BOROUGH_5,BOROUGH_1,NEIGHBORHOOD_OTHER,NEIGHBORHOOD_FLUSHING-NORTH,NEIGHBORHOOD_BEDFORD STUYVESANT,NEIGHBORHOOD_FOREST HILLS,NEIGHBORHOOD_BOROUGH PARK,NEIGHBORHOOD_ASTORIA,TAX_CLASS_AT_PRESENT_1,TAX_CLASS_AT_PRESENT_1D,TAX_CLASS_AT_TIME_OF_SALE
0,2.301955,-0.961260,-0.326869,-0.599005,-0.019976,0.229078,-0.178009,-0.05658,-0.082628,-0.069351,-0.066386,0.111894,-0.111894,0.0
1,-0.434413,1.040301,-0.326869,-0.599005,-0.019976,0.229078,-0.178009,-0.05658,-0.082628,-0.069351,-0.066386,0.111894,-0.111894,0.0
2,-0.434413,-0.961260,3.059331,-0.599005,-0.019976,0.229078,-0.178009,-0.05658,-0.082628,-0.069351,-0.066386,0.111894,-0.111894,0.0
3,2.301955,-0.961260,-0.326869,-0.599005,-0.019976,0.229078,-0.178009,-0.05658,-0.082628,-0.069351,-0.066386,0.111894,-0.111894,0.0
4,2.301955,-0.961260,-0.326869,-0.599005,-0.019976,0.229078,-0.178009,-0.05658,-0.082628,-0.069351,-0.066386,0.111894,-0.111894,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2502,-0.434413,-0.961260,-0.326869,1.669434,-0.019976,0.229078,-0.178009,-0.05658,-0.082628,-0.069351,-0.066386,0.111894,-0.111894,0.0
2503,-0.434413,-0.961260,-0.326869,1.669434,-0.019976,0.229078,-0.178009,-0.05658,-0.082628,-0.069351,-0.066386,0.111894,-0.111894,0.0
2504,-0.434413,-0.961260,-0.326869,1.669434,-0.019976,0.229078,-0.178009,-0.05658,-0.082628,-0.069351,-0.066386,0.111894,-0.111894,0.0
2505,-0.434413,-0.961260,-0.326869,1.669434,-0.019976,0.229078,-0.178009,-0.05658,-0.082628,-0.069351,-0.066386,0.111894,-0.111894,0.0


In [134]:

from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

alpha_mse = {}

for alpha in range(0, 200, 1):
  ridge_reg_split = Ridge(alpha=alpha).fit(X_train_scaled, y_train_scaled)
  mse = mean_squared_error(y_test_scaled, ridge_reg_split.predict(X_test_scaled))
  alpha_mse[alpha] = mse

print(sorted(alpha_mse.items(), reverse=True))

[(199, 0.9474529988857704), (198, 0.9474421667712564), (197, 0.9474313304323786), (196, 0.9474204898673633), (195, 0.9474096450744367), (194, 0.9473987960518263), (193, 0.9473879427977608), (192, 0.9473770853104693), (191, 0.9473662235881823), (190, 0.9473553576291309), (189, 0.9473444874315473), (188, 0.9473336129936644), (187, 0.9473227343137169), (186, 0.947311851389939), (185, 0.9473009642205679), (184, 0.9472900728038398), (183, 0.9472791771379934), (182, 0.9472682772212672), (181, 0.9472573730519019), (180, 0.9472464646281379), (179, 0.9472355519482182), (178, 0.9472246350103855), (177, 0.9472137138128842), (176, 0.9472027883539594), (175, 0.9471918586318581), (174, 0.9471809246448275), (173, 0.9471699863911157), (172, 0.9471590438689729), (171, 0.9471480970766496), (170, 0.9471371460123978), (169, 0.9471261906744705), (168, 0.947115231061122), (167, 0.9471042671706074), (166, 0.9470932990011829), (165, 0.9470823265511061), (164, 0.947071349818636), (163, 0.9470603688020327), (16