Lambda School Data Science

*Unit 2, Sprint 1, Module 3*

---

# Ridge Regression

## Assignment

We're going back to our other **New York City** real estate dataset. Instead of predicting apartment rents, you'll predict property sales prices.

But not just for condos in Tribeca...

- [ ] Use a subset of the data where `BUILDING_CLASS_CATEGORY` == `'01 ONE FAMILY DWELLINGS'` and the sale price was more than 100 thousand and less than 2 million.
- [ ] Do train/test split. Use data from January — March 2019 to train. Use data from April 2019 to test.
- [ ] Do one-hot encoding of categorical features.
- [ ] Do feature selection with `SelectKBest`.
- [ ] Fit a ridge regression model with multiple features. Use the `normalize=True` parameter (or do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html) beforehand — use the scaler's `fit_transform` method with the train set, and the scaler's `transform` method with the test set)
- [ ] Get mean absolute error for the test set.
- [ ] As always, commit your notebook to your fork of the GitHub repo.

The [NYC Department of Finance](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page) has a glossary of property sales terms and NYC Building Class Code Descriptions. The data comes from the [NYC OpenData](https://data.cityofnewyork.us/browse?q=NYC%20calendar%20sales) portal.


## Stretch Goals

Don't worry, you aren't expected to do all these stretch goals! These are just ideas to consider and choose from.

- [ ] Add your own stretch goal(s) !
- [ ] Instead of `Ridge`, try `LinearRegression`. Depending on how many features you select, your errors will probably blow up! 💥
- [ ] Instead of `Ridge`, try [`RidgeCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html).
- [ ] Learn more about feature selection:
    - ["Permutation importance"](https://www.kaggle.com/dansbecker/permutation-importance)
    - [scikit-learn's User Guide for Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html)
    - [mlxtend](http://rasbt.github.io/mlxtend/) library
    - scikit-learn-contrib libraries: [boruta_py](https://github.com/scikit-learn-contrib/boruta_py) & [stability-selection](https://github.com/scikit-learn-contrib/stability-selection)
    - [_Feature Engineering and Selection_](http://www.feat.engineering/) by Kuhn & Johnson.
- [ ] Try [statsmodels](https://www.statsmodels.org/stable/index.html) if you’re interested in more inferential statistical approach to linear regression and feature selection, looking at p values and 95% confidence intervals for the coefficients.
- [ ] Read [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf), Chapters 1-3, for more math & theory, but in an accessible, readable way.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [28]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [29]:
import pandas as pd
import pandas_profiling

# Read New York City property sales data
df = pd.read_csv(DATA_PATH+'condos/NYC_Citywide_Rolling_Calendar_Sales.csv')

# Change column names: replace spaces with underscores
df.columns = [col.replace(' ', '_') for col in df]

# SALE_PRICE was read as strings.
# Remove symbols, convert to integer
df['SALE_PRICE'] = (
    df['SALE_PRICE']
    .str.replace('$','')
    .str.replace('-','')
    .str.replace(',','')
    .astype(int)
)

In [30]:
# BOROUGH is a numeric column, but arguably should be a categorical feature,
# so convert it from a number to a string
df['BOROUGH'] = df['BOROUGH'].astype(str)

In [31]:
# Reduce cardinality for NEIGHBORHOOD feature

# Get a list of the top 10 neighborhoods
top10 = df['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10, 
# replace the neighborhood with 'OTHER'
df.loc[~df['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'

In [32]:
df.sample(5)

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
21389,3,OTHER,15 CONDOS - 2-10 UNIT RESIDENTIAL,2C,1527,1006,,R1,"205 MACDOUGAL STREET, 3B",3B,11233.0,1.0,0.0,1.0,0,698.0,2016.0,2,R1,544764,04/17/2019
16172,5,OTHER,01 ONE FAMILY DWELLINGS,1,1158,72,,A1,36 HOUSMAN AVENUE,,10303.0,1.0,0.0,1.0,4995,1208.0,1899.0,1,A1,377500,03/21/2019
21818,3,OTHER,02 TWO FAMILY DWELLINGS,1,5909,25,,B1,448 72ND STREET,,11209.0,2.0,0.0,2.0,2000,2730.0,1901.0,1,B1,0,04/19/2019
3608,3,BOROUGH PARK,03 THREE FAMILY DWELLINGS,1,5684,36,,C0,1350 55TH STREET,,11219.0,3.0,0.0,3.0,5008,4192.0,1925.0,1,C0,0,01/20/2019
12615,2,OTHER,05 TAX CLASS 1 VACANT LAND,1B,5740,331,,V0,N/A NETHERLAND AVENUE,,0.0,0.0,0.0,0.0,6028,0.0,0.0,1,V0,0,03/05/2019


In [33]:
#subsetting the data

df = df[(df['SALE_PRICE'] > 100000) & (df['SALE_PRICE'] < 2000000) & 
        (df['BUILDING_CLASS_CATEGORY'] == '01 ONE FAMILY DWELLINGS')]

df.sample(5)

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
6333,4,OTHER,01 ONE FAMILY DWELLINGS,1,3686,40,,A1,71-62 71ST STREET,,11385.0,1.0,0.0,1.0,4000,1980.0,1920.0,1,A1,650000,01/31/2019
4610,4,OTHER,01 ONE FAMILY DWELLINGS,1D,16350,400,,A8,"195 REID AVENUE, 1582",,11697.0,0.0,0.0,0.0,0,0.0,1938.0,1,A8,450000,01/24/2019
3925,4,OTHER,01 ONE FAMILY DWELLINGS,1,4643,49,,A1,146-35 WILLETS POINT BLVD,,11357.0,1.0,0.0,1.0,4045,1440.0,1935.0,1,A1,1090000,01/22/2019
20956,4,OTHER,01 ONE FAMILY DWELLINGS,1,14181,19,,A1,159-30 102ND STREET,,11414.0,1.0,0.0,1.0,4000,1544.0,1930.0,1,A1,590000,04/15/2019
13584,4,OTHER,01 ONE FAMILY DWELLINGS,1,11726,31,,A1,130-11 LEFFERTS BOULEVARD,,11420.0,1.0,0.0,1.0,2320,1234.0,1925.0,1,A1,450000,03/08/2019


In [34]:
#splitting into test and train

df['SALE_DATE'] = pd.to_datetime(df['SALE_DATE'], infer_datetime_format=True)
train_cutoff = pd.to_datetime('03/31/2019', infer_datetime_format=True)
train = df[df['SALE_DATE'] <= train_cutoff]
test = df[df['SALE_DATE'] > train_cutoff]
train.shape, test.shape

((2507, 21), (644, 21))

#One Hot Encoding 

In [35]:
#finding the categorical features with low cardinality 

train.describe(exclude='number')

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,LAND_SQUARE_FEET,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_DATE
count,2507.0,2507,2507,2507.0,2507,2507,1,2507.0,2507,2507
unique,5.0,7,1,2.0,13,2497,1,887.0,11,68
top,4.0,OTHER,01 ONE FAMILY DWELLINGS,1.0,A1,294 FREEBORN STREET,RP.,4000.0,A1,2019-01-31 00:00:00
freq,1204.0,2360,2507,2476.0,919,2,1,234.0,919,78
first,,,,,,,,,,2019-01-01 00:00:00
last,,,,,,,,,,2019-03-30 00:00:00


In [36]:
#splitting into X and y 

target = 'SALE_PRICE'
high_cardinality = ['ADDRESS', 'LAND_SQUARE_FEET', 'SALE_DATE']
features = train.columns.drop([target] + high_cardinality)

features

Index(['BOROUGH', 'NEIGHBORHOOD', 'BUILDING_CLASS_CATEGORY',
       'TAX_CLASS_AT_PRESENT', 'BLOCK', 'LOT', 'EASE-MENT',
       'BUILDING_CLASS_AT_PRESENT', 'APARTMENT_NUMBER', 'ZIP_CODE',
       'RESIDENTIAL_UNITS', 'COMMERCIAL_UNITS', 'TOTAL_UNITS',
       'GROSS_SQUARE_FEET', 'YEAR_BUILT', 'TAX_CLASS_AT_TIME_OF_SALE',
       'BUILDING_CLASS_AT_TIME_OF_SALE'],
      dtype='object')

In [51]:
X_train = train[features]
y_train = train[target]

X_test = test[features]
y_test = test[target]

In [55]:
#Applying one hot encoding to the data'
import category_encoders as ce
encoder = ce.OneHotEncoder(use_cat_names=True)
X_train = encoder.fit_transform(X_train)
X_test = encoder.transform(X_test)
X_train.sample(5)

Unnamed: 0,BOROUGH_3,BOROUGH_4,BOROUGH_2,BOROUGH_5,BOROUGH_1,NEIGHBORHOOD_OTHER,NEIGHBORHOOD_FLUSHING-NORTH,NEIGHBORHOOD_EAST NEW YORK,NEIGHBORHOOD_BEDFORD STUYVESANT,NEIGHBORHOOD_FOREST HILLS,NEIGHBORHOOD_BOROUGH PARK,NEIGHBORHOOD_ASTORIA,BUILDING_CLASS_CATEGORY_01 ONE FAMILY DWELLINGS,TAX_CLASS_AT_PRESENT_1,TAX_CLASS_AT_PRESENT_1D,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT_A9,BUILDING_CLASS_AT_PRESENT_A1,BUILDING_CLASS_AT_PRESENT_A5,BUILDING_CLASS_AT_PRESENT_A0,BUILDING_CLASS_AT_PRESENT_A2,BUILDING_CLASS_AT_PRESENT_A3,BUILDING_CLASS_AT_PRESENT_S1,BUILDING_CLASS_AT_PRESENT_A4,BUILDING_CLASS_AT_PRESENT_A6,BUILDING_CLASS_AT_PRESENT_A8,BUILDING_CLASS_AT_PRESENT_B2,BUILDING_CLASS_AT_PRESENT_S0,BUILDING_CLASS_AT_PRESENT_B3,APARTMENT_NUMBER_nan,APARTMENT_NUMBER_RP.,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE_A9,BUILDING_CLASS_AT_TIME_OF_SALE_A1,BUILDING_CLASS_AT_TIME_OF_SALE_A5,BUILDING_CLASS_AT_TIME_OF_SALE_A0,BUILDING_CLASS_AT_TIME_OF_SALE_A2,BUILDING_CLASS_AT_TIME_OF_SALE_A3,BUILDING_CLASS_AT_TIME_OF_SALE_S1,BUILDING_CLASS_AT_TIME_OF_SALE_A4,BUILDING_CLASS_AT_TIME_OF_SALE_A6,BUILDING_CLASS_AT_TIME_OF_SALE_A8,BUILDING_CLASS_AT_TIME_OF_SALE_S0
1991,0,1,0,0,0,1,0,0,0,0,0,0,1,0,1,16350,400,,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,11697.0,0.0,0.0,0.0,0.0,1938.0,1,0,0,0,0,0,0,0,0,0,1,0
7689,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,12214,19,,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,11434.0,1.0,0.0,1.0,1091.0,1925.0,1,0,0,0,0,1,0,0,0,0,0,0
15666,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,12586,22,,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,11434.0,1.0,0.0,1.0,1169.0,1930.0,1,0,0,0,0,1,0,0,0,0,0,0
9528,0,0,0,1,0,1,0,0,0,0,0,0,1,1,0,1215,96,,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,10303.0,1.0,0.0,1.0,1200.0,1993.0,1,0,0,1,0,0,0,0,0,0,0,0
14165,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,12476,20,,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,11434.0,1.0,0.0,1.0,1600.0,1940.0,1,0,0,1,0,0,0,0,0,0,0,0


In [56]:
#adding back square feet of land to test and train because it may be a useful column 
X_train['LAND_SQUARE_FEET'] = train['LAND_SQUARE_FEET'].str.replace(',', '').astype(int)
X_test['LAND_SQUARE_FEET'] = test['LAND_SQUARE_FEET'].str.replace(',', '').astype(int)

In [57]:
list(X_test.columns)

['BOROUGH_3',
 'BOROUGH_4',
 'BOROUGH_2',
 'BOROUGH_5',
 'BOROUGH_1',
 'NEIGHBORHOOD_OTHER',
 'NEIGHBORHOOD_FLUSHING-NORTH',
 'NEIGHBORHOOD_EAST NEW YORK',
 'NEIGHBORHOOD_BEDFORD STUYVESANT',
 'NEIGHBORHOOD_FOREST HILLS',
 'NEIGHBORHOOD_BOROUGH PARK',
 'NEIGHBORHOOD_ASTORIA',
 'BUILDING_CLASS_CATEGORY_01 ONE FAMILY DWELLINGS',
 'TAX_CLASS_AT_PRESENT_1',
 'TAX_CLASS_AT_PRESENT_1D',
 'BLOCK',
 'LOT',
 'EASE-MENT',
 'BUILDING_CLASS_AT_PRESENT_A9',
 'BUILDING_CLASS_AT_PRESENT_A1',
 'BUILDING_CLASS_AT_PRESENT_A5',
 'BUILDING_CLASS_AT_PRESENT_A0',
 'BUILDING_CLASS_AT_PRESENT_A2',
 'BUILDING_CLASS_AT_PRESENT_A3',
 'BUILDING_CLASS_AT_PRESENT_S1',
 'BUILDING_CLASS_AT_PRESENT_A4',
 'BUILDING_CLASS_AT_PRESENT_A6',
 'BUILDING_CLASS_AT_PRESENT_A8',
 'BUILDING_CLASS_AT_PRESENT_B2',
 'BUILDING_CLASS_AT_PRESENT_S0',
 'BUILDING_CLASS_AT_PRESENT_B3',
 'APARTMENT_NUMBER_nan',
 'APARTMENT_NUMBER_RP.',
 'ZIP_CODE',
 'RESIDENTIAL_UNITS',
 'COMMERCIAL_UNITS',
 'TOTAL_UNITS',
 'GROSS_SQUARE_FEET',
 'YEAR_BUIL

In [45]:
X_train.isnull().sum()

BOROUGH_3                                             0
BOROUGH_4                                             0
BOROUGH_2                                             0
BOROUGH_5                                             0
BOROUGH_1                                             0
NEIGHBORHOOD_OTHER                                    0
NEIGHBORHOOD_FLUSHING-NORTH                           0
NEIGHBORHOOD_EAST NEW YORK                            0
NEIGHBORHOOD_BEDFORD STUYVESANT                       0
NEIGHBORHOOD_FOREST HILLS                             0
NEIGHBORHOOD_BOROUGH PARK                             0
NEIGHBORHOOD_ASTORIA                                  0
BUILDING_CLASS_CATEGORY_01 ONE FAMILY DWELLINGS       0
TAX_CLASS_AT_PRESENT_1                                0
TAX_CLASS_AT_PRESENT_1D                               0
BLOCK                                                 0
LOT                                                   0
EASE-MENT                                       

In [59]:
X_train.drop('EASE-MENT', axis=1, inplace=True)
X_test.drop('EASE-MENT', axis=1, inplace=True)

In [53]:
X_test.dtypes

BOROUGH                            object
NEIGHBORHOOD                       object
BUILDING_CLASS_CATEGORY            object
TAX_CLASS_AT_PRESENT               object
BLOCK                               int64
LOT                                 int64
EASE-MENT                         float64
BUILDING_CLASS_AT_PRESENT          object
APARTMENT_NUMBER                   object
ZIP_CODE                          float64
RESIDENTIAL_UNITS                 float64
COMMERCIAL_UNITS                  float64
TOTAL_UNITS                       float64
GROSS_SQUARE_FEET                 float64
YEAR_BUILT                        float64
TAX_CLASS_AT_TIME_OF_SALE           int64
BUILDING_CLASS_AT_TIME_OF_SALE     object
dtype: object

In [48]:
X_train.shape, y_train.shape

((2507, 51), (2507,))

#selecting features with Kbest

In [61]:
#finding how what K value is best for out ridge regression data
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.feature_selection import SelectKBest, f_regression

for k in range(1, len(X_train.columns)+ 1):
  print(k, 'features')

  selector = SelectKBest(score_func=f_regression, k=k)
  X_train_selected = selector.fit_transform(X_train, y_train)
  X_test_selected = selector.transform(X_test)

  model = LinearRegression()
  model.fit(X_train_selected, y_train)
  y_pred = model.predict(X_test_selected)
  mae = mean_absolute_error(y_test, y_pred)
  print(mae)

1 features
183640.5858012459
2 features
179554.76843033516
3 features
179291.46658251202
4 features
179291.46658251315
5 features
178896.91028453427
6 features
177128.05627841034
7 features
171377.51979181689
8 features
171464.9236411722
9 features
170925.15918959922
10 features
167467.6534826572
11 features
167181.8079770252
12 features
166712.32028072016
13 features
156488.3995649053
14 features
157546.39911151255
15 features
157543.1574464786
16 features
157544.7025643954
17 features
157405.8454338135
18 features
157405.8454338105
19 features
157282.47005799421
20 features


  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freed

157282.47005800027
21 features
152830.40163785467
22 features
152861.07851621442
23 features
152861.07851623124
24 features
152861.07851623904
25 features
152861.0785162184
26 features
152551.53848446786
27 features
152414.7649820592
28 features
152337.67446251912
29 features
152288.19845744455
30 features
153293.65932598233
31 features
153448.67631508224
32 features
153448.67631506902
33 features
153437.25548474758
34 features
153418.89382234795
35 features
153413.73094079096
36 features
153394.25749326457
37 features


  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freed

153204.70235751694
38 features
153204.7023575364
39 features
154578.88620324145
40 features
154681.4501848063
41 features
154737.99503713605
42 features
154731.71616770708
43 features
154731.71616770327
44 features
154720.2085300161
45 features
154956.90266176584
46 features
154917.93533506108
47 features
154913.92897563806
48 features
154914.903435559
49 features
154913.05953998448
50 features
154916.8312791149
51 features
154902.26048136645


  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freed

14 features seems to be the sweet spot 

In [None]:
#applying a KBest selector with 14 features 
selector = SelectKBest(k=15)