# Lab 2 Classification <br>
Group Members: Thomas Pengilly, Quynh Chao, Anish Patel, Michael Weatherford <br>
Date: 10/11/2020

# Summary:
The dataset used in this analysis is the South American Real Estate Listings dataset, which contains a number of property features and listing prices. This dataset will be used to perform both a price estimation regression, and a <font color = 'red'> price classification </font>.  

## Data Preparation
<font color = 'red'>Define and prepare your class variables, use proper variable represenations (float, int, one-hot). Use pre-processing methods for dimensionality reduction, scaling, etc. Remove variables that are nnot needed/useful for the analysis. <br>
Describe the final dataset that is used and include description of any newly formed variables created.</font><br>

The classification we are modeling is to predict whether a property has a (log) price classification of low, average, or high for a given property type. Price classifications are defined separately by log price tertiles (3-quantiles) for each property type. In other words, each property type has its own price classification distribution (but each is balanced within property types). The benefit of this method of labeling allows users to determine if a property is over/underpriced in relation to other properties on the market, and to take advantage of this information.<br>

<font color = 'red'> Dimensionality reduction was explored to reduce a large number of dummy variables to a more manageable size, allowing faster computation times, and hopefully improving model performance by removing redundant and correlated variables. </font> The dataset was scaled and normalized to allow feature importance to be accurately measured by model weights. Several variables were also removed. <br>

## Modeling and Evaluation
<font color = 'red'>Choose and explain your evaluation metrics that you will use (i.e., accuracy, precision, recall, F-measure, or any metric we have discussed). Why are the measure(s) appropriate for analyzing the results of your modeling? Give a detailed explanation backing up any assertions.<br>

INSERT EXPLANATION<br>

Choose the method you will use for dividing your data into training and testing splits (i.e., are you using Stratified 10-fold cross validation? Why?). Explain why your chosen method is appropriate or use more than one method as appropriate.<br>

INSERT EXPLANATION<br>

Create three different classification/regression models (e.g., random forest, KNN, and SVM). Two modeling techniques must be new (but the third could be SVM or logistic regression). Adjust parameters as appropriate to increase generalization performance using your chosen metric.<br>

INSERT EXPLANATION<br>

Analyze the results using your chosen method of evaluation. Use visualizations of the results to bolster the analysis. Explain any visuals and analyze why they are interesting to someone that might use this model.<br>

INSERT EXPLANATION<br>

Discuss the advantages of each model for each classification task, if any. If there are not advantages, explain why. Is any model better than another? Is the difference significant with 95% confidence? Use proper statistical comparison methods.<br>

INSERT EXPLANATION<br>

Which attributes from your analysis are most important? Use proper methods discussed in class to evaluate the importance of different attributes. Discuss the results and hypothesize about why certain attributes are more important than others for a given classification task.<br>

INSERT EXPLANATION<br>

## Deployment
How useful is your model for interested parties (i.e., the companies or organizations that might want to use it for prediction)? How would you measure the model's value if it was used by these parties? How would your deploy your model for interested parties? What other data should be collected? How often would the model need to be updated, etc.?<br>

INSERT EXPLANATION<br>

## Exceptional Work
You have free reign to provide additional modeling.
One idea: grid search parameters in a parallelized fashion and visualize the performances across attributes. Which parameters are most significant for making a good model for each classification algorithm?<br>

INSERT EXPLANATION<br>

</font><br>

# Data Preparation
The classification variable is a stratified log price of low, average, or high, with threshold prices being defined on a property-type basis. The data will be scaled so that feature importance may be determined using class weights, and dimensionality reduction will be explored (PCA?). In addition to this, several variables that are not thought to be important will be removed from the dataset. <font color = 'red'> Which variables were removed? </font>

In [1]:
# Import libraries
import numpy as np
import pandas as pd
from pandas import set_option
set_option('display.max_columns',400)
import matplotlib.pyplot as plt
import scipy

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold
import sklearn.neighbors
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics as mt
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

In [2]:
# Read in the imputed dataset

# Tom
df = pd.read_csv('C:\\Users\\Tpeng\\OneDrive\\Documents\\SMU\\Term 3\\Machine Learning\\Lab1\\Imputed_Dataset.csv', sep = ',', header = 0)

# Quynh
#df = pd.read_csv('filepath', sep = ',', header = 0)

# Anish
#df = pd.read_csv('filepath, sep = ',', header = 0)

# Michael
#df = pd.read_csv('filepath', sep = ',', header = 0)

# Drop index column
df = df.drop(columns = 'Unnamed: 0')

In [3]:
# Reformat attributes, excluding categoricals, which aren't supported for the the dummy variable generation method used.
ordinal_vars = ['rooms', 'bedrooms', 'bathrooms' ]
continuous_vars = ['lat', 'lon', 'surface_total', 'surface_covered', 'price', 'log_price']
string_vars = ['id', 'title', 'description']
time_vars = ['start_date', 'end_date', 'created_on']

# Change data types
df[ordinal_vars] = df[ordinal_vars].astype('uint8')
df[continuous_vars] = df[continuous_vars].astype(np.float64)
df[string_vars] = df[string_vars].astype(str)

# Remove observations missing l3 and price before encoding 
#df2 = df.dropna(axis = 0, subset = ['price', 'l3'])

df2 = df.dropna(axis = 0, subset = ['price'])

Create a transformed dataset with numeric variables square-root transformed to better meet model assumptions of feature distributions. This reduces the number and magnitude of outliers. In addition to this, both datasets will have the property_type, country, province, and department dummified, and all other attributes will be scaled. Both the transformed and non-transformed datasets will be used to create competing models.

In [4]:
# Create datasets with transformed variables for model selection methods
# Transform rooms, bedrooms, bathrooms, surface_total, and surface_covered using square root
df_transform = df2.copy()
df_transform['sqrt_surface_total'] = df_transform.surface_total.transform(func = 'sqrt')
df_transform['sqrt_surface_covered'] = df_transform.surface_covered.transform(func = 'sqrt')
df_transform['sqrt_bedrooms'] = df_transform.bedrooms.transform(func = 'sqrt')
df_transform['sqrt_bathrooms'] = df_transform.bathrooms.transform(func = 'sqrt')
df_transform['sqrt_rooms'] = df_transform.rooms.transform(func = 'sqrt')

df_transform = df_transform.drop(columns = ['surface_total', 'surface_covered', 'bedrooms', 'bathrooms', 'rooms'])

# Dataset Exploration:
In order to decrease computational costs, we create two datasets to test: a full dataset and one that excludes l3.

## THE CELL BELOW USES L3 (FULL DATASET)

In [5]:
# Get dummy variables for non-transformed dataset
#data = pd.get_dummies(df2, columns = ['l1', 'l2', 'l3', 'property_type'], 
#                      prefix = {'l1':'Country', 'l2':'Province', 'l3': 'Department', 'property_type': 'Property_type'}, 
#                      sparse = True, drop_first = False)

# Drop reference levels for each dummified feature and unimportant or currently unusable features. 
#data = data.drop(columns = ['Country_Argentina', 'Province_Misiones', 'Department_Posadas', 'Property_type_Casa'])
#data = data.drop(columns = ['id', 'start_date', 'end_date', 'created_on', 'lat', 'lon', 'title', 'description', 'price'])

# Get dummy variables for transformed dataset
trans = pd.get_dummies(df_transform, columns = ['l1', 'l2', 'l3', 'property_type'], 
                          prefix = {'l1':'Country', 'l2':'Province', 'l3': 'Department', 'property_type': 'Property_type'}, 
                          sparse = True, drop_first = False)

# Drop reference levels for each dummified feature (same references as non-transformed data) and unimportant or currently unusable features. 
trans = trans.drop(columns = ['Country_Argentina', 'Province_Misiones', 'Department_Posadas', 'Property_type_Casa'])
trans = trans.drop(columns = ['id', 'start_date', 'end_date', 'created_on', 'lat', 'lon', 'title', 'description', 'price'])

## THE CELL BELOW EXCLUDES L3

In [6]:
# Get dummy variables for transformed dataset
nol3 = df_transform.drop(columns = 'l3')
trans_nol3 = pd.get_dummies(nol3, columns = ['l1', 'l2', 'property_type'], 
                          prefix = {'l1':'Country', 'l2':'Province', 'property_type': 'Property_type'}, 
                          sparse = True, drop_first = False)

# Drop reference levels for each dummified feature (same references as non-transformed data) and unimportant or currently unusable features. 
trans_nol3 = trans_nol3.drop(columns = ['Country_Argentina', 'Province_Misiones', 'Property_type_Casa'])
trans_nol3 = trans_nol3.drop(columns = ['id', 'start_date', 'end_date', 'created_on', 'lat', 'lon', 'title', 'description', 'price'])

# This cell includes log_price * property type interaction terms.
Add interaction terms for log_price*property_type

In [7]:
props = ['Property_type_Casa de campo', 'Property_type_Departamento', 'Property_type_Depósito', 'Property_type_Finca',
         'Property_type_Garaje', 'Property_type_Local comercial', 'Property_type_Lote', 'Property_type_Oficina',
         'Property_type_Otro', 'Property_type_PH', 'Property_type_Parqueadero']

# Log price - property type interaction terms
ints = ['int_Casa de campo', 'int_Departamento', 'int_Depósito', 'int_Finca',
         'int_Garaje', 'int_Local comercial', 'int_Lote', 'int_Oficina',
         'int_Otro', 'int_PH', 'int_Parqueadero']

int_df = trans_nol3.copy()
int_vectors = np.empty(shape = (len(int_df), len(props)))
i = 0

# Manually create interaction terms
for int in ints:
    int_vectors[:,i] =  int_df[props[i]] * int_df['log_price']
    i += 1

In [23]:
int_df.head(20)

Unnamed: 0,log_price,price_class,sqrt_surface_total,sqrt_surface_covered,sqrt_bedrooms,sqrt_bathrooms,sqrt_rooms,Country_Colombia,Country_Ecuador,Country_Perú,Country_Uruguay,Province_Amazonas,Province_Ancash,Province_Antioquia,Province_Apurimac,Province_Arauca,Province_Arequipa,Province_Atlántico,Province_Ayacucho,Province_Azuay,Province_Bolívar,Province_Boyacá,Province_Bs.As. G.B.A. Zona Norte,Province_Bs.As. G.B.A. Zona Oeste,Province_Bs.As. G.B.A. Zona Sur,Province_Buenos Aires Costa Atlántica,Province_Buenos Aires Interior,Province_Cajamarca,Province_Caldas,Province_Callao,Province_Canelones,Province_Capital Federal,Province_Caquetá,Province_Casanare,Province_Catamarca,Province_Cauca,Province_Cesar,Province_Chaco,Province_Chocó,Province_Chubut,Province_Colonia,Province_Corrientes,Province_Cundinamarca,Province_Cusco,Province_Córdoba,Province_El Oro,Province_Entre Ríos,Province_Formosa,Province_Guayas,Province_Huancavelica,Province_Huila,Province_Huánuco,Province_Ica,Province_Imbabura,Province_Jujuy,Province_Junín,Province_La Guajira,Province_La Libertad,Province_La Pampa,Province_La Rioja,Province_Lambayeque,Province_Lima,Province_Loreto,Province_Madre de Dios,Province_Magdalena,Province_Maldonado,Province_Manabi,Province_Mendoza,Province_Meta,Province_Montevideo,Province_Moquegua,Province_Morona Santiago,Province_Nariño,Province_Neuquén,Province_Norte de Santander,Province_Pasco,Province_Pastaza,Province_Pichincha,Province_Piura,Province_Puno,Province_Putumayo,Province_Quindío,Province_Risaralda,Province_Rocha,Province_Río Negro,Province_Salta,Province_San Andrés Providencia y Santa Catalina,Province_San Juan,Province_San Luis,Province_San Martin,Province_Santa Cruz,Province_Santa Elena,Province_Santa Fe,Province_Santander,Province_Santiago Del Estero,Province_Santo Domingo De Los Tsáchilas,Province_Sucre,Province_Tacna,Province_Tierra Del Fuego,Province_Tolima,Province_Tucumán,Province_Tumbes,Province_Tungurahua,Province_Ucayali,Province_Valle del Cauca,Province_Vichada,Property_type_Casa de campo,Property_type_Departamento,Property_type_Depósito,Property_type_Finca,Property_type_Garaje,Property_type_Local comercial,Property_type_Lote,Property_type_Oficina,Property_type_Otro,Property_type_PH,Property_type_Parqueadero
0,12.860999,High,14.071247,12.247449,1.732422,1.414062,2.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,12.860999,High,14.071247,12.247449,1.732422,1.414062,2.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,12.180755,Average,13.152946,13.152946,1.732422,1.414062,2.646484,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,11.350407,Low,7.0,6.324555,1.732422,1.0,1.732422,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,11.350407,Low,7.0,6.324555,1.732422,1.0,1.732422,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,13.253392,High,20.0,20.0,1.732422,1.732422,2.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,13.748302,High,13.892444,13.228757,1.414062,1.732422,2.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
7,11.396392,Low,5.567764,5.567764,1.414062,1.0,1.414062,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
8,11.797352,Average,5.656854,5.656854,1.414062,1.0,1.414062,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
9,12.345835,High,9.380832,9.380832,1.414062,1.0,1.732422,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0


In [44]:
from IPython.core.display import display, HTML
np.set_printoptions(edgeitems=20, linewidth=1000000, formatter=dict(float=lambda x: "%.3g" % x))

display(HTML("<style>.container { width:98% !important; }</style>"))
display(HTML("<style>.output_result { width:98% !important; }</style>"))

int_vectors[0:20,:]

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 13.7, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 11.4, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 11.8, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 12.3, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 11.6, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 11.7, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 11.7, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 11.8, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 11.1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 12.5, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 12, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 11.9, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 11.9, 0, 0, 0, 0, 0],
       [0, 11.6, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [8]:
# Create new (reduced) dataset with interaction terms
X_pre = trans_nol3.drop(columns = 'price_class').values
X = np.append(X_pre, int_vectors, axis = 1)
y = trans_nol3.price_class.values

<font color = 'red'> ADD df.info() to show data types, number of elements, etc..

# Data Preparation Discussion:
The final dataset used will be standardized and broken into PCA components. The dataset used is <font color = 'red'> WHICH ONE? <br><br>We are using stratified 5-fold cross validation to keep the dataset balanced, while significantly reducing computation time compared to the 10 fold cross validation.

In [8]:
# Import libraries
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics as mt
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

In [9]:
# Create a cross validation object
cv_obj = StratifiedKFold(n_splits = 5, random_state = 6)

# Modeling and Evaluation

## Evaluation Metrics:
Accuracy will be used to determine the most appropriate model since it is the most important metric for property value classifications. A high accuracy means customers can trust most of the model's predictions.

A case can also be made to consider precision as well, which is a measure of how many of the model's predictions were incorrect for a given class. This measure tells customers how well they can trust that the model's predicted class is the correct one. A high precision indicates that the model's predicted classification is likely correct.

## Data splits
The data will be split using 5-fold stratified cross validation. The data splits should be stratified to ensure that each price classification is equally represented in every cross validation split. <font color = 'red'>5 folds</font> are used as this number is thought to provide sufficient generalization while reducing computation time.

## 1. KNN Classification Modeling
The following KNN classification modeling is performed on a reduced dataset to reduce computation times. After optimal parameters are found, these will be compared to classification models trained on the full dataset.

In [15]:
# Create a list of k values to test
k = [1,5,11,21, 31, 41, 51, 61]

# Train CV KNN Classification models
it = 0

for train, test in cv_obj.split(X,y):
    print('######## CV SPLIT ', it, ' #######\n')
    for i in k:
        clf_pipe = Pipeline([('Scaler', StandardScaler()),
                             ('PCA', PCA(n_components = .99, svd_solver = 'full', random_state = 6)),
                             ('clf_knn', KNeighborsClassifier(n_neighbors = i, n_jobs = -1))])
        
        %time clf_pipe.fit(X[train],y[train])
        yhat = clf_pipe.predict(X[test])
        total_acc = mt.accuracy_score(y[test],yhat)
        precision, recall, f_score, support = mt.precision_recall_fscore_support(y[test], yhat, 
                                                                                 labels = ['Low', 'Average', 'High'])
        
        print('KNN Accuracy: k = ', i, '  ', total_acc)
        print('Precision: ', precision)
        print('Recall: ', recall)
        print('F Score: ', f_score, '\n')
    it += 1
    print('#############################\n')

######## CV SPLIT  0  #######

Wall time: 8.84 s
KNN Accuracy: k =  1    0.5084963232915706
Precision:  [0.57233159 0.41144673 0.5569844 ]
Recall:  [0.49792944 0.44965593 0.57563065]
F Score:  [0.53254438 0.42970362 0.56615404] 

Wall time: 8.7 s
KNN Accuracy: k =  5    0.5585254350591939
Precision:  [0.68188709 0.44269878 0.61419543]
Recall:  [0.49176743 0.55608933 0.62435046]
F Score:  [0.57142857 0.49295754 0.61923131] 

Wall time: 8.79 s
KNN Accuracy: k =  11    0.5885105557112975
Precision:  [0.69383958 0.471319   0.64614713]
Recall:  [0.54364751 0.56647624 0.65253677]
F Score:  [0.60962924 0.51453506 0.64932623] 

Wall time: 8.78 s
KNN Accuracy: k =  21    0.6022793435835507
Precision:  [0.69633444 0.48477687 0.66420977]
Recall:  [0.5645188  0.57264347 0.66692911]
F Score:  [0.6235363  0.52505952 0.66556666] 

Wall time: 8.83 s
KNN Accuracy: k =  31    0.60705582990102
Precision:  [0.70272265 0.48963809 0.66742654]
Recall:  [0.56776545 0.57747988 0.67310175]
F Score:  [0.62807615

The cross validation suggests higher k-values will result in better performing models for all metrics considered. Another cross-validation will be run using a new k-window centered on the estimated optimum k value of 51.

In [13]:
# Create a list of k values to test
k = [41, 47, 51, 57, 61, 67, 71]

# Train CV KNN models
#X = trans_nol3.drop(columns = ['price_class']).values
#y = trans_nol3.price_class.values
it = 0

for train, test in cv_obj.split(X,y):
    print('######## CV SPLIT ', it, ' #######\n')
    for i in k:
        clf_pipe = Pipeline([('Scaler', StandardScaler()),
                             ('PCA', PCA(n_components = .99, svd_solver = 'full', random_state = 6)),
                             ('clf_knn', KNeighborsClassifier(n_neighbors = i, n_jobs = -1))])
        
        %time clf_pipe.fit(X[train],y[train])
        yhat = clf_pipe.predict(X[test])
        total_acc = mt.accuracy_score(y[test],yhat)
        precision, recall, f_score, support = mt.precision_recall_fscore_support(y[test],yhat, 
                                                                                 labels = ['Low', 'Average', 'High'])
        
        print('KNN Accuracy: k = ', i, '  ', total_acc)
        print('Precision: ', precision)
        print('Recall: ', recall)
        print('F Score: ', f_score, '\n')
    it += 1
    print('#############################\n')

######## CV SPLIT  0  #######
Wall time: 8.79 s
KNN Accuracy: k =  41    0.6092877320854808
Precision:  [0.70551594 0.49078018 0.67075575]
Recall:  [0.57034951 0.57968709 0.67502283]
F Score:  [0.63077289 0.53154159 0.67288253]
Wall time: 8.75 s
KNN Accuracy: k =  47    0.6096004140340284
Precision:  [0.70542191 0.4912986  0.6707359 ]
Recall:  [0.57025012 0.57913529 0.67656599]
F Score:  [0.63067453 0.53161313 0.67363833]
Wall time: 8.74 s
KNN Accuracy: k =  51    0.6111314773682962
Precision:  [0.70643738 0.49333481 0.67153695]
Recall:  [0.57296671 0.58020644 0.67741631]
F Score:  [0.63274004 0.53325577 0.67446382]
Wall time: 8.92 s
KNN Accuracy: k =  57    0.6117460591292346
Precision:  [0.70849437 0.49335895 0.67305528]
Recall:  [0.57087958 0.58354973 0.67795169]
F Score:  [0.63228576 0.53467761 0.67549461]
Wall time: 8.62 s
KNN Accuracy: k =  61    0.6115843270868825
Precision:  [0.70811943 0.49257419 0.67366281]
Recall:  [0.57121087 0.58241366 0.67826662]
F Score:  [0.63233946 0.5

The optimum k value in most situations is 57 for the interaction term dataset excluding l3. Now try with a different distance metric to account for our high dimensional boolean-valued data.

In [14]:
# Create a list of k values to test
k = [57]

# Create a cross validation object
cv_obj = StratifiedKFold(n_splits = 5, random_state = 6)

# Use the new dataset with interaction terms
X_pre = trans_nol3.drop(columns = 'price_class').values
X = np.append(X_pre, int_vectors, axis = 1)
y = trans_nol3.price_class.values

it = 0

for train, test in cv_obj.split(X,y):
    print('######## CV SPLIT ', it, ' #######\n')
    for i in k:
        clf_pipe = Pipeline([('Scaler', StandardScaler()),
                             ('PCA', PCA(n_components = .99, svd_solver = 'full', random_state = 6)),
                             ('clf_knn', KNeighborsClassifier(n_neighbors = i, n_jobs = -1))])
        
        %time clf_pipe.fit(X[train],y[train])
        yhat = clf_pipe.predict(X[test])
        total_acc = mt.accuracy_score(y[test],yhat)
        precision, recall, f_score, support = mt.precision_recall_fscore_support(y[test],yhat, 
                                                                                 labels = ['Low', 'Average', 'High'])
        
        print('KNN Accuracy: k = ', i, '  ', total_acc)
        print('Precision: ', precision)
        print('Recall: ', recall)
        print('F Score: ', f_score, '\n')
    it += 1
    print('#############################\n')

######## CV SPLIT  0  #######

Wall time: 9.03 s
KNN Accuracy: k =  57    0.6125115907963685
Precision:  [0.70813554 0.49430518 0.67334394]
Recall:  [0.57326487 0.58179694 0.67962082]
F Score:  [0.63360246 0.53449433 0.67646782] 

#############################

######## CV SPLIT  1  #######

Wall time: 9.12 s
KNN Accuracy: k =  57    0.5760418351393606
Precision:  [0.66615655 0.46010518 0.63190243]
Recall:  [0.5768428  0.52820696 0.62169312]
F Score:  [0.61829093 0.49180972 0.6267562 ] 

#############################

######## CV SPLIT  2  #######

Wall time: 9.02 s
KNN Accuracy: k =  57    0.5900371987708233
Precision:  [0.67345181 0.47626027 0.64644345]
Recall:  [0.56995196 0.53787977 0.65973797]
F Score:  [0.61739427 0.50519801 0.65302305] 

#############################

######## CV SPLIT  3  #######

Wall time: 9.12 s
KNN Accuracy: k =  57    0.5623005261795911
Precision:  [0.59505669 0.46908418 0.61578229]
Recall:  [0.64686589 0.44423526 0.59646636]
F Score:  [0.61988063 0.456321

## KNN Classification with full dataset (including L3)
The above models excluded l3 Departments for the sake of computation time. We will now run the models using a full dataset including interaction terms between log price and property type. We will then compare the models' performances.

In [15]:
# Create a list of k values to test
k = [21, 31, 41, 51, 61]

# Train CV KNN models
X = trans.drop(columns = ['price_class']).values
X = np.append(X, int_vectors, axis = 1)
y = trans.price_class.values
it = 0

for train, test in cv_obj.split(X,y):
    print('######## CV SPLIT ', it, ' #######\n')
    for i in k:
        clf_pipe = Pipeline([('Scaler', StandardScaler()),
                             ('PCA', PCA(n_components = .99, svd_solver = 'full', random_state = 6)),
                             ('clf_knn', KNeighborsClassifier(n_neighbors = i, n_jobs = -1))])
        
        %time clf_pipe.fit(X[train],y[train])
        yhat = clf_pipe.predict(X[test])
        total_acc = mt.accuracy_score(y[test],yhat)
        precision, recall, f_score, support = mt.precision_recall_fscore_support(y[test],yhat, 
                                                                                 labels = ['Low', 'Average', 'High'])
        
        print('KNN Accuracy: k = ', i, '  ', total_acc)
        print('Precision: ', precision)
        print('Recall: ', recall)
        print('F Score: ', f_score, '\n')
    it += 1
    print('#############################\n')

######## CV SPLIT  0  #######

Wall time: 2min 18s
KNN Accuracy: k =  21    0.6272292066504216
Precision:  [0.71108353 0.50247135 0.70385429]
Recall:  [0.60745403 0.59065827 0.68151041]
F Score:  [0.65519644 0.54300762 0.69250216] 

Wall time: 2min 18s
KNN Accuracy: k =  31    0.6299570870980958
Precision:  [0.71136512 0.50604654 0.70599922]
Recall:  [0.61441113 0.59085303 0.68267565]
F Score:  [0.659343   0.54517139 0.69414157] 

Wall time: 2min 21s
KNN Accuracy: k =  41    0.6314018933431091
Precision:  [0.71104935 0.50772195 0.70757946]
Recall:  [0.61868478 0.59010647 0.68355746]
F Score:  [0.6616592  0.54582301 0.69536106] 

Wall time: 2min 18s
KNN Accuracy: k =  51    0.631585189657775
Precision:  [0.71161305 0.50792632 0.70755765]
Recall:  [0.61997681 0.59072319 0.68226624]
F Score:  [0.66264186 0.54620487 0.69468182] 

Wall time: 2min 17s
KNN Accuracy: k =  61    0.6301296012766049
Precision:  [0.70858465 0.50646702 0.70723813]
Recall:  [0.61881729 0.58848351 0.68128996]
F Score

KeyboardInterrupt: 

NotFittedError: This PCA instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.

In [11]:
# Create a list of k values to test
k = [55, 57, 59]

# Train CV KNN models
X = trans.drop(columns = ['price_class']).values
X = np.append(X, int_vectors, axis = 1)
y = trans.price_class.values
it = 0

for train, test in cv_obj.split(X,y):
    print('######## CV SPLIT ', it, ' #######\n')
    for i in k:
        clf_pipe = Pipeline([('Scaler', StandardScaler()),
                             ('PCA', PCA(n_components = .99, svd_solver = 'full', random_state = 6)),
                             ('clf_knn', KNeighborsClassifier(n_neighbors = i, n_jobs = -1))])
        
        %time clf_pipe.fit(X[train],y[train])
        yhat = clf_pipe.predict(X[test])
        total_acc = mt.accuracy_score(y[test],yhat)
        precision, recall, f_score, support = mt.precision_recall_fscore_support(y[test],yhat, 
                                                                                 labels = ['Low', 'Average', 'High'])
        
        print('KNN Accuracy: k = ', i, '  ', total_acc)
        print('Precision: ', precision)
        print('Recall: ', recall)
        print('F Score: ', f_score, '\n')
    it += 1
    print('#############################\n')

######## CV SPLIT  0  #######

Wall time: 2min 21s
KNN Accuracy: k =  55    0.6313372005261683
Precision:  [0.71107483 0.50760218 0.70774164]
Recall:  [0.62047375 0.59059335 0.68119548]
F Score:  [0.66269195 0.54596192 0.69421488] 

Wall time: 2min 22s
KNN Accuracy: k =  57    0.6307010544929161
Precision:  [0.71010305 0.50693903 0.70768576]
Recall:  [0.61868478 0.59046352 0.68116398]
F Score:  [0.6612492  0.5455227  0.69417164] 

Wall time: 2min 23s
KNN Accuracy: k =  59    0.6306687080844456
Precision:  [0.70917031 0.50708587 0.70796663]
Recall:  [0.61871791 0.59000909 0.68147892]
F Score:  [0.66086341 0.54541363 0.6944703 ] 

#############################

######## CV SPLIT  1  #######

Wall time: 2min 24s
KNN Accuracy: k =  55    0.6114938810717558
Precision:  [0.69313305 0.48844084 0.68537745]
Recall:  [0.62063939 0.56852116 0.64449483]
F Score:  [0.65488613 0.52544737 0.66430774] 

Wall time: 2min 23s
KNN Accuracy: k =  57    0.6121192517116826
Precision:  [0.69360021 0.48921927 

In [19]:
# Final KNN Model trained on full dataset

# Train CV KNN models
X = trans.drop(columns = ['price_class']).values
X = np.append(X, int_vectors, axis = 1)
y = trans.price_class.values
it = 0

for train, test in cv_obj.split(X,y):
    print('######## CV SPLIT ', it, ' #######\n')
    clf_pipe = Pipeline([('Scaler', StandardScaler()),
                        ('PCA', PCA(n_components = .99, svd_solver = 'full', random_state = 6)),
                        ('clf_knn', KNeighborsClassifier(n_neighbors = 57, n_jobs = -1))])
        
    %time clf_pipe.fit(X[train],y[train])
    yhat = clf_pipe.predict(X[test])
    total_acc = mt.accuracy_score(y[test],yhat)
    precision, recall, f_score, support = mt.precision_recall_fscore_support(y[test],yhat, 
                                                                            labels = ['Low', 'Average', 'High'])
        
    print('KNN Accuracy: k = 57 ', total_acc)
    print('Precision: ', precision)
    print('Recall: ', recall)
    print('F Score: ', f_score, '\n')
        
    print("confusion matrix\n",mt.confusion_matrix(y[test],yhat), '\n')
    it += 1
    print('#############################\n')

######## CV SPLIT  0  #######

Wall time: 2min 21s
KNN Accuracy: k = 57  0.6307010544929161
Precision:  [0.71010305 0.50693903 0.70768576]
Recall:  [0.61868478 0.59046352 0.68116398]
F Score:  [0.6612492  0.5455227  0.69417164] 

confusion matrix
 [[18191  6796  5821]
 [ 8321 21629  1803]
 [ 9372  2138 18675]] 

#############################

######## CV SPLIT  1  #######

Wall time: 2min 24s
KNN Accuracy: k = 57  0.6121192517116826
Precision:  [0.69360021 0.48921927 0.68589636]
Recall:  [0.62222958 0.56930018 0.64405392]
F Score:  [0.65597932 0.52623052 0.66431692] 

confusion matrix
 [[17539  7141  6128]
 [ 9133 20450  2169]
 [ 9179  2224 18782]] 

#############################

######## CV SPLIT  2  #######

Wall time: 2min 21s
KNN Accuracy: k = 57  0.6202706345355545
Precision:  [0.69465045 0.50019833 0.69350557]
Recall:  [0.611297   0.57303298 0.67463467]
F Score:  [0.65031367 0.5341442  0.68393997] 

confusion matrix
 [[17654  7019  6135]
 [ 8355 21421  1976]
 [ 9285  2448 18452]

## KNN Discussion:


## 2. Random Forest Classification Modeling

In [11]:
# Create a Random Forest, Scaling, PCA pipeline
from sklearn.ensemble import RandomForestClassifier

clf_pipe = Pipeline([('Scaler', StandardScaler()),
                     ('PCA', PCA(n_components = .99, svd_solver = 'full', random_state = 6)),
                     ('CLF', RandomForestClassifier(n_estimators = 100, criterion = 'gini', max_depth = None, 
                                                  min_samples_split = 4, min_samples_leaf = 1, n_jobs = -1, random_state = 6, 
                                                  verbose = 1))])

In [12]:
# Train CV Random Forest models using the reduced dataset (no l3)
X = trans_nol3.drop(columns = ['price_class']).values
X = np.append(X, int_vectors, axis = 1)
y = trans_nol3.price_class.values
it = 0

for train, test in cv_obj.split(X,y):
    print('######## CV SPLIT ', it, ' #######')
    clf_pipe = Pipeline([('Scaler', StandardScaler()),
                     ('PCA', PCA(n_components = .99, svd_solver = 'full', random_state = 6)),
                     ('CLF', RandomForestClassifier(n_estimators = 250, criterion = 'gini', max_depth = None, 
                                                 min_samples_split = 4, min_samples_leaf = 1, n_jobs = -1, random_state = 6, 
                                                 verbose = 1))])
        
    %time clf_pipe.fit(X[train],y[train])
    yhat = clf_pipe.predict(X[test])
    total_acc = mt.accuracy_score(y[test],yhat)
    precision, recall, f_score, support = mt.precision_recall_fscore_support(y[test],yhat, 
                                                                                 labels = ['Low', 'Average', 'High'])
    print('Random Forest Accuracy: ', total_acc)
    print('Precision: ', precision)
    print('Recall: ', recall)
    print('F Score: ', f_score, '\n')
    it += 1
    print('############################')

######## CV SPLIT  0  #######


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:   44.6s
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  1.2min finished


Wall time: 1min 16s


[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.7s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    1.2s finished


Random Forest Accuracy:  0.566116058913592
Precision:  [0.64571052 0.45767409 0.61363432]
Recall:  [0.55156535 0.4985718  0.64548232]
F Score:  [0.59493648 0.47724836 0.62915554] 

############################
######## CV SPLIT  1  #######


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:   44.3s
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  1.2min finished


Wall time: 1min 16s


[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.7s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    1.3s finished


Random Forest Accuracy:  0.5479432853523101
Precision:  [0.62177345 0.43867483 0.59770754]
Recall:  [0.56100712 0.47192288 0.60928445]
F Score:  [0.58982933 0.45469187 0.60344048] 

############################
######## CV SPLIT  2  #######


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:   44.7s
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  1.2min finished


Wall time: 1min 18s


[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.7s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    1.2s finished


Random Forest Accuracy:  0.5553075637500674
Precision:  [0.6220112  0.44919102 0.60664597]
Recall:  [0.54467451 0.47851207 0.63992819]
F Score:  [0.58077962 0.46338818 0.62284278] 

############################
######## CV SPLIT  3  #######


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:   44.9s
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  1.2min finished


Wall time: 1min 18s


[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.6s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    1.0s finished


Random Forest Accuracy:  0.5271068748382645
Precision:  [0.5599876  0.43365253 0.58062038]
Recall:  [0.59859528 0.41158141 0.57123961]
F Score:  [0.57864818 0.4223288  0.57589179] 

############################
######## CV SPLIT  4  #######


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:   37.2s
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  1.0min finished


Wall time: 1min 5s


[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.4s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    0.6s finished


Random Forest Accuracy:  0.45434156755765936
Precision:  [0.44340892 0.38202767 0.54037076]
Recall:  [0.7424132  0.24108157 0.38740867]
F Score:  [0.55521419 0.29561376 0.45128036] 

############################


In [14]:
# Use CV on full dataset
X = trans.drop(columns = ['price_class']).values
X = np.append(X, int_vectors, axis = 1)
y = trans_nol3.price_class.values

In [16]:
# Use a GridSearchCV Object for random forest classification
clf_pipe = Pipeline([('Scaler', StandardScaler()),
                     ('PCA', PCA(n_components = .99, svd_solver = 'full', random_state = 6)),
                     ('CLF', RandomForestClassifier(max_depth = None, min_samples_leaf = 1, n_jobs = -1, random_state = 6, verbose = 1))])

params = dict(CLF__n_estimators = [250, 1000],
             CLF__criterion = ['gini'],
             CLF__min_samples_split = [2, 6, 12])

grid = GridSearchCV(clf_pipe, param_grid = params, cv = 5, n_jobs = -1, verbose = 2, scoring = 'accuracy')
grid.fit(X, y)

print(grid.best_score_, '\n')
print(grid.cv_results_, '\n')
print(grid.best_estimator_, '\n')

Fitting 5 folds for each of 6 candidates, totalling 30 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] CLF__criterion=gini, CLF__min_samples_split=2, CLF__n_estimators=250 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  3.7min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    1.4s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    2.3s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    4.9s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    8.6s finished


[CV]  CLF__criterion=gini, CLF__min_samples_split=2, CLF__n_estimators=250, total= 5.4min
[CV] CLF__criterion=gini, CLF__min_samples_split=2, CLF__n_estimators=250 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  5.8min remaining:    0.0s
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  3.6min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    1.3s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    2.3s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    4.9s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    8.4s finished


[CV]  CLF__criterion=gini, CLF__min_samples_split=2, CLF__n_estimators=250, total= 5.3min
[CV] CLF__criterion=gini, CLF__min_samples_split=2, CLF__n_estimators=250 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  3.7min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    1.4s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    2.3s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    5.1s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    8.5s finished


[CV]  CLF__criterion=gini, CLF__min_samples_split=2, CLF__n_estimators=250, total= 5.3min
[CV] CLF__criterion=gini, CLF__min_samples_split=2, CLF__n_estimators=250 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  3.8min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    1.3s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    2.1s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    5.4s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    8.8s finished


[CV]  CLF__criterion=gini, CLF__min_samples_split=2, CLF__n_estimators=250, total= 5.4min
[CV] CLF__criterion=gini, CLF__min_samples_split=2, CLF__n_estimators=250 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  3.7min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.9s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    1.7s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    5.1s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    8.7s finished


[CV]  CLF__criterion=gini, CLF__min_samples_split=2, CLF__n_estimators=250, total= 5.3min
[CV] CLF__criterion=gini, CLF__min_samples_split=2, CLF__n_estimators=1000 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:  5.9min
[Parallel(n_jobs=-1)]: Done 736 tasks      | elapsed: 10.8min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed: 14.2min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    1.3s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    3.8s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:    7.0s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:    9.4s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    4.5s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:   13.1s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:   24.6s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:

[CV]  CLF__criterion=gini, CLF__min_samples_split=2, CLF__n_estimators=1000, total=15.9min
[CV] CLF__criterion=gini, CLF__min_samples_split=2, CLF__n_estimators=1000 

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:  5.7min
[Parallel(n_jobs=-1)]: Done 736 tasks      | elapsed: 10.7min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed: 14.1min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    1.2s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    3.5s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:    6.9s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:    9.3s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    4.8s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:   13.4s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:   25.3s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:


[CV]  CLF__criterion=gini, CLF__min_samples_split=2, CLF__n_estimators=1000, total=15.9min
[CV] CLF__criterion=gini, CLF__min_samples_split=2, CLF__n_estimators=1000 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:  5.8min
[Parallel(n_jobs=-1)]: Done 736 tasks      | elapsed: 10.8min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed: 14.1min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    1.2s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    3.5s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:    6.8s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:    9.1s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    4.7s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:   12.8s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:   24.8s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:

[CV]  CLF__criterion=gini, CLF__min_samples_split=2, CLF__n_estimators=1000, total=15.9min
[CV] CLF__criterion=gini, CLF__min_samples_split=2, CLF__n_estimators=1000 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:  5.9min
[Parallel(n_jobs=-1)]: Done 736 tasks      | elapsed: 11.1min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed: 14.5min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    1.2s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    3.3s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:    6.4s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:    8.7s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    4.8s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:   13.5s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:   25.7s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:

[CV]  CLF__criterion=gini, CLF__min_samples_split=2, CLF__n_estimators=1000, total=16.2min
[CV] CLF__criterion=gini, CLF__min_samples_split=2, CLF__n_estimators=1000 

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:  5.8min
[Parallel(n_jobs=-1)]: Done 736 tasks      | elapsed: 10.8min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed: 14.2min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    1.0s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    2.9s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:    5.5s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:    7.3s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    5.0s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:   13.7s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:   26.1s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:


[CV]  CLF__criterion=gini, CLF__min_samples_split=2, CLF__n_estimators=1000, total=15.9min
[CV] CLF__criterion=gini, CLF__min_samples_split=6, CLF__n_estimators=250 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  3.4min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    1.3s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    2.2s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    4.7s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    7.8s finished


[CV]  CLF__criterion=gini, CLF__min_samples_split=6, CLF__n_estimators=250, total= 5.0min
[CV] CLF__criterion=gini, CLF__min_samples_split=6, CLF__n_estimators=250 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  3.4min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    1.2s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    2.0s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    4.8s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    8.0s finished


[CV]  CLF__criterion=gini, CLF__min_samples_split=6, CLF__n_estimators=250, total= 5.0min
[CV] CLF__criterion=gini, CLF__min_samples_split=6, CLF__n_estimators=250 

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  3.4min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    1.2s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    2.1s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    4.8s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    8.0s finished



[CV]  CLF__criterion=gini, CLF__min_samples_split=6, CLF__n_estimators=250, total= 5.0min
[CV] CLF__criterion=gini, CLF__min_samples_split=6, CLF__n_estimators=250 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  3.5min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    1.2s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    2.0s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    4.8s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    8.4s finished


[CV]  CLF__criterion=gini, CLF__min_samples_split=6, CLF__n_estimators=250, total= 5.2min
[CV] CLF__criterion=gini, CLF__min_samples_split=6, CLF__n_estimators=250 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  3.4min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.8s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    1.4s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    5.1s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    8.2s finished


[CV]  CLF__criterion=gini, CLF__min_samples_split=6, CLF__n_estimators=250, total= 5.1min
[CV] CLF__criterion=gini, CLF__min_samples_split=6, CLF__n_estimators=1000 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:  5.5min
[Parallel(n_jobs=-1)]: Done 736 tasks      | elapsed: 10.2min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed: 13.5min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    1.3s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    3.5s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:    6.6s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:    8.9s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    4.4s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:   12.3s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:   23.4s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:

[CV]  CLF__criterion=gini, CLF__min_samples_split=6, CLF__n_estimators=1000, total=15.2min
[CV] CLF__criterion=gini, CLF__min_samples_split=6, CLF__n_estimators=1000 

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:  5.5min
[Parallel(n_jobs=-1)]: Done 736 tasks      | elapsed: 10.1min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed: 13.4min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    1.2s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    3.4s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:    6.5s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:    8.7s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    4.4s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:   12.4s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:   23.2s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:


[CV]  CLF__criterion=gini, CLF__min_samples_split=6, CLF__n_estimators=1000, total=15.1min
[CV] CLF__criterion=gini, CLF__min_samples_split=6, CLF__n_estimators=1000 

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:  5.5min
[Parallel(n_jobs=-1)]: Done 736 tasks      | elapsed: 10.2min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed: 13.3min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    1.2s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    3.4s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:    6.6s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:    8.8s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    4.6s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:   12.3s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:   23.4s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:


[CV]  CLF__criterion=gini, CLF__min_samples_split=6, CLF__n_estimators=1000, total=15.1min
[CV] CLF__criterion=gini, CLF__min_samples_split=6, CLF__n_estimators=1000 

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:  5.3min
[Parallel(n_jobs=-1)]: Done 736 tasks      | elapsed: 10.3min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed: 13.3min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    1.3s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    3.4s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:    6.3s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:    8.3s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    4.5s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:   12.7s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:   23.9s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:


[CV]  CLF__criterion=gini, CLF__min_samples_split=6, CLF__n_estimators=1000, total=15.0min


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:  5.4min
[Parallel(n_jobs=-1)]: Done 736 tasks      | elapsed: 10.0min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed: 13.2min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.9s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    2.8s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:    5.4s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:    7.5s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    5.2s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:   14.1s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:   25.9s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:

[CV] CLF__criterion=gini, CLF__min_samples_split=6, CLF__n_estimators=1000 
[CV]  CLF__criterion=gini, CLF__min_samples_split=6, CLF__n_estimators=1000, total=14.9min
[CV] CLF__criterion=gini, CLF__min_samples_split=12, CLF__n_estimators=250 

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  3.4min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    1.1s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    1.9s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    4.3s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    7.4s finished



[CV]  CLF__criterion=gini, CLF__min_samples_split=12, CLF__n_estimators=250, total= 5.1min
[CV] CLF__criterion=gini, CLF__min_samples_split=12, CLF__n_estimators=250 

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  3.3min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    1.1s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    2.0s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    4.7s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    7.5s finished



[CV]  CLF__criterion=gini, CLF__min_samples_split=12, CLF__n_estimators=250, total= 5.0min
[CV] CLF__criterion=gini, CLF__min_samples_split=12, CLF__n_estimators=250 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  3.3min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    1.1s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    1.9s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    4.3s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    7.3s finished


[CV]  CLF__criterion=gini, CLF__min_samples_split=12, CLF__n_estimators=250, total= 4.9min
[CV] CLF__criterion=gini, CLF__min_samples_split=12, CLF__n_estimators=250 

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  3.4min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    1.1s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    1.9s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    4.6s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    7.8s finished



[CV]  CLF__criterion=gini, CLF__min_samples_split=12, CLF__n_estimators=250, total= 5.0min
[CV] CLF__criterion=gini, CLF__min_samples_split=12, CLF__n_estimators=250 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  3.3min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.9s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    1.5s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    4.6s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    7.9s finished


[CV]  CLF__criterion=gini, CLF__min_samples_split=12, CLF__n_estimators=250, total= 4.9min
[CV] CLF__criterion=gini, CLF__min_samples_split=12, CLF__n_estimators=1000 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:  5.5min
[Parallel(n_jobs=-1)]: Done 736 tasks      | elapsed: 10.0min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed: 13.3min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    1.2s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    3.4s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:    6.5s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:    8.6s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    4.6s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:   12.3s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:   22.9s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:

[CV]  CLF__criterion=gini, CLF__min_samples_split=12, CLF__n_estimators=1000, total=15.0min
[CV] CLF__criterion=gini, CLF__min_samples_split=12, CLF__n_estimators=1000 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:  5.2min
[Parallel(n_jobs=-1)]: Done 736 tasks      | elapsed:  9.9min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed: 13.1min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    1.2s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    3.4s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:    6.6s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:    8.8s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    4.4s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:   12.0s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:   22.8s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:

[CV]  CLF__criterion=gini, CLF__min_samples_split=12, CLF__n_estimators=1000, total=14.8min
[CV] CLF__criterion=gini, CLF__min_samples_split=12, CLF__n_estimators=1000 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:  5.3min
[Parallel(n_jobs=-1)]: Done 736 tasks      | elapsed: 10.0min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed: 13.1min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    1.2s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    3.3s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:    6.2s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:    8.3s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    4.6s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:   12.3s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:   23.2s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:

[CV]  CLF__criterion=gini, CLF__min_samples_split=12, CLF__n_estimators=1000, total=14.8min
[CV] CLF__criterion=gini, CLF__min_samples_split=12, CLF__n_estimators=1000 

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:  5.5min
[Parallel(n_jobs=-1)]: Done 736 tasks      | elapsed: 10.0min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed: 13.2min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    1.1s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    3.3s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:    6.2s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:    8.1s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    4.7s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:   12.7s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:   23.8s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:


[CV]  CLF__criterion=gini, CLF__min_samples_split=12, CLF__n_estimators=1000, total=15.0min
[CV] CLF__criterion=gini, CLF__min_samples_split=12, CLF__n_estimators=1000 

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:  5.3min
[Parallel(n_jobs=-1)]: Done 736 tasks      | elapsed: 10.0min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed: 13.1min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.9s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    2.5s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:    5.0s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:    6.7s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    5.0s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:   12.6s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:   23.9s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:


[CV]  CLF__criterion=gini, CLF__min_samples_split=12, CLF__n_estimators=1000, total=14.8min


[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed: 325.4min finished
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.7min
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:  7.0min
[Parallel(n_jobs=-1)]: Done 736 tasks      | elapsed: 12.8min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed: 17.1min finished


0.5658399518678177 

{'mean_fit_time': array([312.9516582 , 943.64186339, 297.18233438, 890.43272991,
       291.12389221, 880.24048772]), 'std_fit_time': array([3.02418661, 7.8142871 , 5.59126542, 5.85530264, 4.19870856,
       5.43745507]), 'mean_score_time': array([ 6.52651863, 13.37808328,  6.83038378, 13.29363241,  6.51859365,
       12.90916367]), 'std_score_time': array([0.26831916, 0.76398858, 0.38734134, 0.58828994, 0.27695194,
       0.89927646]), 'param_CLF__criterion': masked_array(data=['gini', 'gini', 'gini', 'gini', 'gini', 'gini'],
             mask=[False, False, False, False, False, False],
       fill_value='?',
            dtype=object), 'param_CLF__min_samples_split': masked_array(data=[2, 2, 6, 6, 12, 12],
             mask=[False, False, False, False, False, False],
       fill_value='?',
            dtype=object), 'param_CLF__n_estimators': masked_array(data=[250, 1000, 250, 1000, 250, 1000],
             mask=[False, False, False, False, False, False],
       f

In [23]:
print('The optimum parameters were: ', grid.best_params_, '\n')
print('The best estimator was: ', grid.best_estimator_, '\n')
print('With an accuracy of: ', grid.best_score_)


The optimum parameters were:  {'CLF__criterion': 'gini', 'CLF__min_samples_split': 12, 'CLF__n_estimators': 1000} 

The best estimator was:  Pipeline(memory=None,
     steps=[('Scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('PCA', PCA(copy=True, iterated_power='auto', n_components=0.99, random_state=6,
  svd_solver='full', tol=0.0, whiten=False)), ('CLF', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            ...imators=1000, n_jobs=-1,
            oob_score=False, random_state=6, verbose=1, warm_start=False))]) 

With an accuracy of:  0.5658399518678177


In [26]:
cvres = pd.DataFrame(grid.cv_results_)
cvres
# Export to CSV
#cvres.to_csv('RandomForestClassification_Results 1.csv', index = False)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_CLF__criterion,param_CLF__min_samples_split,param_CLF__n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
0,312.951658,3.024187,6.526519,0.268319,gini,2,250,"{'CLF__criterion': 'gini', 'CLF__min_samples_s...",0.590182,0.579126,0.585476,0.547033,0.454385,0.551241,0.050724,6,0.970165,0.970208,0.970799,0.962389,0.95982,0.966676,0.004627
1,943.641863,7.814287,13.378083,0.763989,gini,2,1000,"{'CLF__criterion': 'gini', 'CLF__min_samples_s...",0.591249,0.579988,0.58567,0.547593,0.453695,0.55164,0.051264,5,0.970165,0.970208,0.970799,0.962389,0.95982,0.966676,0.004627
2,297.182334,5.591265,6.830384,0.387341,gini,6,250,"{'CLF__criterion': 'gini', 'CLF__min_samples_s...",0.598107,0.585724,0.591698,0.555939,0.458385,0.557971,0.051846,4,0.955221,0.954226,0.955164,0.947865,0.945353,0.951566,0.00414
3,890.43273,5.855303,13.293632,0.58829,gini,6,1000,"{'CLF__criterion': 'gini', 'CLF__min_samples_s...",0.598948,0.586393,0.592657,0.555497,0.457404,0.55818,0.052554,3,0.956461,0.955318,0.956547,0.949156,0.946674,0.952831,0.004113
4,291.123892,4.198709,6.518594,0.276952,gini,12,250,"{'CLF__criterion': 'gini', 'CLF__min_samples_s...",0.607821,0.595471,0.600259,0.563206,0.458924,0.565137,0.055246,2,0.901945,0.900663,0.901536,0.896452,0.89593,0.899305,0.002582
5,880.240488,5.437455,12.909164,0.899276,gini,12,1000,"{'CLF__criterion': 'gini', 'CLF__min_samples_s...",0.608576,0.595612,0.600226,0.563142,0.461641,0.56584,0.054332,1,0.903549,0.902361,0.903342,0.897948,0.897148,0.90087,0.002753


In [20]:
grid.best_estimator_
grid.best_score_

Pipeline(memory=None,
     steps=[('Scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('PCA', PCA(copy=True, iterated_power='auto', n_components=0.99, random_state=6,
  svd_solver='full', tol=0.0, whiten=False)), ('CLF', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            ...imators=1000, n_jobs=-1,
            oob_score=False, random_state=6, verbose=1, warm_start=False))])

In [21]:
grid.best_score_

0.5658399518678177

The fact that the best model was at 1000 estimators and a minimum leaf split size of 12, the limits of our grid search, this suggests testing even higher values. To speed up this process, larger leaf split sizes will be explored on a smaller number of estimators. The 1000 estimator models provided a negligible increase in accuracy over the 250 estimator model while requiring significantly higher training times.

In [27]:
# Use a GridSearchCV Object for random forest classification
clf_pipe = Pipeline([('Scaler', StandardScaler()),
                     ('PCA', PCA(n_components = .99, svd_solver = 'full', random_state = 6)),
                     ('CLF', RandomForestClassifier(max_depth = None, min_samples_leaf = 1, n_jobs = -1, random_state = 6, verbose = 1))])

params = dict(CLF__n_estimators = [250],
             CLF__criterion = ['gini'],
             CLF__min_samples_split = [16, 30, 50, 60])

grid = GridSearchCV(clf_pipe, param_grid = params, cv = 5, n_jobs = -1, verbose = 2, scoring = 'accuracy')
grid.fit(X, y)

Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] CLF__criterion=gini, CLF__min_samples_split=16, CLF__n_estimators=250 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  3.2min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    1.2s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    2.0s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    4.0s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    6.9s finished


[CV]  CLF__criterion=gini, CLF__min_samples_split=16, CLF__n_estimators=250, total= 4.8min
[CV] CLF__criterion=gini, CLF__min_samples_split=16, CLF__n_estimators=250 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  5.2min remaining:    0.0s
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  3.2min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    1.1s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    1.9s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    4.4s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    7.1s finished


[CV]  CLF__criterion=gini, CLF__min_samples_split=16, CLF__n_estimators=250, total= 4.8min
[CV] CLF__criterion=gini, CLF__min_samples_split=16, CLF__n_estimators=250 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  3.2min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    1.3s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    2.0s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    4.1s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    7.0s finished


[CV]  CLF__criterion=gini, CLF__min_samples_split=16, CLF__n_estimators=250, total= 4.9min
[CV] CLF__criterion=gini, CLF__min_samples_split=16, CLF__n_estimators=250 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  3.2min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    1.0s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    1.8s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    4.2s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    7.1s finished


[CV]  CLF__criterion=gini, CLF__min_samples_split=16, CLF__n_estimators=250, total= 4.7min
[CV] CLF__criterion=gini, CLF__min_samples_split=16, CLF__n_estimators=250 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  3.2min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.8s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    1.4s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    4.1s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    7.3s finished


[CV]  CLF__criterion=gini, CLF__min_samples_split=16, CLF__n_estimators=250, total= 4.8min
[CV] CLF__criterion=gini, CLF__min_samples_split=30, CLF__n_estimators=250 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  3.2min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    1.0s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    1.7s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    3.7s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    6.3s finished


[CV]  CLF__criterion=gini, CLF__min_samples_split=30, CLF__n_estimators=250, total= 4.8min
[CV] CLF__criterion=gini, CLF__min_samples_split=30, CLF__n_estimators=250 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  3.1min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    1.0s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    1.7s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    3.6s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    6.2s finished


[CV]  CLF__criterion=gini, CLF__min_samples_split=30, CLF__n_estimators=250, total= 4.6min
[CV] CLF__criterion=gini, CLF__min_samples_split=30, CLF__n_estimators=250 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  3.1min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    1.0s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    1.7s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    3.6s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    6.2s finished


[CV]  CLF__criterion=gini, CLF__min_samples_split=30, CLF__n_estimators=250, total= 4.6min
[CV] CLF__criterion=gini, CLF__min_samples_split=30, CLF__n_estimators=250 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  3.1min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    1.6s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    2.3s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    3.5s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    6.4s finished


[CV]  CLF__criterion=gini, CLF__min_samples_split=30, CLF__n_estimators=250, total= 4.6min
[CV] CLF__criterion=gini, CLF__min_samples_split=30, CLF__n_estimators=250 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  3.0min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.8s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    1.4s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    3.6s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    6.2s finished


[CV]  CLF__criterion=gini, CLF__min_samples_split=30, CLF__n_estimators=250, total= 4.6min
[CV] CLF__criterion=gini, CLF__min_samples_split=50, CLF__n_estimators=250 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  2.9min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    1.0s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    1.7s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    3.3s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    5.7s finished


[CV]  CLF__criterion=gini, CLF__min_samples_split=50, CLF__n_estimators=250, total= 4.5min
[CV] CLF__criterion=gini, CLF__min_samples_split=50, CLF__n_estimators=250 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  2.9min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    1.0s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    1.7s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    3.2s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    5.7s finished


[CV]  CLF__criterion=gini, CLF__min_samples_split=50, CLF__n_estimators=250, total= 4.5min
[CV] CLF__criterion=gini, CLF__min_samples_split=50, CLF__n_estimators=250 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  2.9min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.9s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    1.6s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    3.2s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    5.7s finished


[CV]  CLF__criterion=gini, CLF__min_samples_split=50, CLF__n_estimators=250, total= 4.5min
[CV] CLF__criterion=gini, CLF__min_samples_split=50, CLF__n_estimators=250 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  2.9min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.9s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    1.6s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    3.3s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    5.8s finished


[CV]  CLF__criterion=gini, CLF__min_samples_split=50, CLF__n_estimators=250, total= 4.5min
[CV] CLF__criterion=gini, CLF__min_samples_split=50, CLF__n_estimators=250 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  2.9min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.7s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    1.2s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    3.3s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    5.8s finished


[CV]  CLF__criterion=gini, CLF__min_samples_split=50, CLF__n_estimators=250, total= 4.5min
[CV] CLF__criterion=gini, CLF__min_samples_split=60, CLF__n_estimators=250 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  2.9min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.9s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    1.6s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    3.2s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    5.5s finished


[CV]  CLF__criterion=gini, CLF__min_samples_split=60, CLF__n_estimators=250, total= 4.5min
[CV] CLF__criterion=gini, CLF__min_samples_split=60, CLF__n_estimators=250 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  3.0min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.8s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    1.5s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    3.1s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    5.3s finished


[CV]  CLF__criterion=gini, CLF__min_samples_split=60, CLF__n_estimators=250, total= 4.6min
[CV] CLF__criterion=gini, CLF__min_samples_split=60, CLF__n_estimators=250 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  2.9min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.9s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    1.6s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    3.4s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    5.7s finished


[CV]  CLF__criterion=gini, CLF__min_samples_split=60, CLF__n_estimators=250, total= 4.5min
[CV] CLF__criterion=gini, CLF__min_samples_split=60, CLF__n_estimators=250 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  2.9min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.8s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    1.5s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    3.3s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    5.7s finished


[CV]  CLF__criterion=gini, CLF__min_samples_split=60, CLF__n_estimators=250, total= 4.4min
[CV] CLF__criterion=gini, CLF__min_samples_split=60, CLF__n_estimators=250 


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  2.8min finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.8s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    1.3s finished
[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    3.3s
[Parallel(n_jobs=32)]: Done 250 out of 250 | elapsed:    5.7s finished


[CV]  CLF__criterion=gini, CLF__min_samples_split=60, CLF__n_estimators=250, total= 4.5min


[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed: 100.2min finished
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  3.8min finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('Scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('PCA', PCA(copy=True, iterated_power='auto', n_components=0.99, random_state=6,
  svd_solver='full', tol=0.0, whiten=False)), ('CLF', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            ...ators='warn', n_jobs=-1,
            oob_score=False, random_state=6, verbose=1, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=1,
       param_grid={'CLF__n_estimators': [250], 'CLF__criterion': ['gini'], 'CLF__min_samples_split': [16, 30, 50, 60]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=2)

In [28]:
print('The optimum parameters were: ', grid.best_params_, '\n')
print('The best estimator was: ', grid.best_estimator_, '\n')
print('With an accuracy of: ', grid.best_score_)

cvres = pd.DataFrame(grid.cv_results_)
cvres
# Export to CSV
cvres.to_csv('RandomForestClassification_Results 2.csv', index = False)

The optimum parameters were:  {'CLF__criterion': 'gini', 'CLF__min_samples_split': 60, 'CLF__n_estimators': 250} 

The best estimator was:  Pipeline(memory=None,
     steps=[('Scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('PCA', PCA(copy=True, iterated_power='auto', n_components=0.99, random_state=6,
  svd_solver='full', tol=0.0, whiten=False)), ('CLF', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            ...timators=250, n_jobs=-1,
            oob_score=False, random_state=6, verbose=1, warm_start=False))]) 

With an accuracy of:  0.5822721754150646




The optimum parameters for Random Forest was min_samples_split 60, the limit of our grid search. Let's try again with increased limits.

<font color = 'blue'>I don't think the cell below ran all 5 folds, I have to rerun it.

In [11]:
# Use a GridSearchCV Object for random forest classification
clf_pipe = Pipeline([('Scaler', StandardScaler()),
                     ('PCA', PCA(n_components = .99, svd_solver = 'full', random_state = 6)),
                     ('CLF', RandomForestClassifier(n_jobs = -1, random_state = 6, verbose = 1))])

params = dict(CLF__n_estimators = [250],
             CLF__criterion = ['gini'],
             CLF__min_samples_split = [60, 72, 84, 96, 108])

grid = GridSearchCV(clf_pipe, param_grid = params, cv = 5, n_jobs = -1, verbose = 2, scoring = 'accuracy')
grid.fit(X, y)

print('The best estimator was: ', grid.best_estimator_, '\n')
print('The optimum parameters were: ', grid.best_params_, '\n')
print('With an accuracy of: ', grid.best_score_)

cvres = pd.DataFrame(grid.cv_results_)
cvres
# Export to CSV
cvres.to_csv('RandomForestClassification_Results 3.csv', index = False)

Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done  14 out of  25 | elapsed: 25.8min remaining: 20.3min
[Parallel(n_jobs=-1)]: Done  25 out of  25 | elapsed: 25.9min finished
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:   51.1s


The best estimator was:  Pipeline(memory=None,
     steps=[('Scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('PCA', PCA(copy=True, iterated_power='auto', n_components=0.99, random_state=6,
  svd_solver='full', tol=0.0, whiten=False)), ('CLF', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            ...timators=250, n_jobs=-1,
            oob_score=False, random_state=6, verbose=1, warm_start=False))]) 

The optimum parameters were:  {'CLF__criterion': 'gini', 'CLF__min_samples_split': 108, 'CLF__n_estimators': 250} 

With an accuracy of:  0.5655272651992677


[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  1.4min finished


## Random Forest Regression
We will create a new dataset which excludes price classification and interaction terms, and uses the log price as the target variable.

In [21]:
# Create a dataset for use in regression to predict the log price
X = trans_nol3.drop(columns = ['price_class', 'log_price']).values
y = trans_nol3.log_price.values

In [None]:
# NEED NEW DATASET!!!!!
# Random Forest Regression
from sklearn.ensemble import RandomForestRegressor

# Use a GridSearchCV Object 
reg_pipe = Pipeline([('Scaler', StandardScaler()),
                    ('PCA', PCA(n_components = .99, svd_solver = 'full', random_state = 6)),
                    ('REG', RandomForestRegressor(random_state = 6, n_jobs = -1, verbose = 1))])

params = dict(REG__n_estimators = [250, 500],
             REG__min_samples_split = [20, 60, 100],
             REG__min_samples_leaf = [1, 3, 5])

grid = GridSearchCV(reg_pipe, param_grid = params, cv = 5, n_jobs = -1, verbose = 2, scoring = 'neg_mean_squared_error')
grid.fit(X,y)

cvres = pd.DataFrame(grid.cv_results_)
cvres
# Export to CSV
cvres.to_csv('RandomForestRegression_Results 1.csv', index = False)

print('The best estimator is: ', grid.best_estimator_, '\n')
print('The optimum parameters were: ', grid.best_params_, '\n')
print('With an MSE of: ', grid.best_score_)

Fitting 5 folds for each of 18 candidates, totalling 90 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.


Consider ADABOOST algorithm

In [21]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# Train CV ADABOOST Decision tree models
X = trans_nol3.drop(columns = ['price_class']).values
y = trans_nol3.price_class.values
it = 0

for train, test in cv_obj.split(X,y):
    print('######## CV SPLIT ', it, ' #######')
    clf_pipe = Pipeline([('PCA', PCA(n_components = .99, svd_solver = 'full', random_state = 6)),
                ('Scaler', StandardScaler()),
                ('ADA', AdaBoostClassifier(n_estimators = 1000, random_state = 6))])
        
    %time clf_pipe.fit(X[train],y[train])
    yhat = clf_pipe.predict(X[test])
    total_acc = mt.accuracy_score(y[test],yhat)
    print('Random Forest Accuracy: ', total_acc)
    it += 1
    print('############################')

######## CV SPLIT  0  #######
Wall time: 11min 3s
Random Forest Accuracy:  0.4751711164498436
############################
######## CV SPLIT  1  #######
Wall time: 11min 16s
Random Forest Accuracy:  0.4511581523956303
############################
######## CV SPLIT  2  #######
Wall time: 11min 17s
Random Forest Accuracy:  0.4625636027786907
############################
######## CV SPLIT  3  #######
Wall time: 11min 5s
Random Forest Accuracy:  0.42617544734426527
############################
######## CV SPLIT  4  #######
Wall time: 11min 17s
Random Forest Accuracy:  0.3732462999478706
############################


In [16]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# Use the interaction term dataset 
X_pre = trans_nol3.drop(columns = 'price_class').values
X = np.append(X_pre, int_vectors, axis = 1)
y = trans_nol3.price_class.values

it = 0

# Train CV Random Forest Model
for train, test in cv_obj.split(X,y):
    print('######## CV SPLIT ', it, ' #######')
    clf_pipe = Pipeline([('PCA', PCA(n_components = .9999, svd_solver = 'full', random_state = 6)),
                ('Scaler', StandardScaler()),
                ('CLF', RandomForestClassifier(n_estimators = 1000, criterion = 'gini', max_depth = 100, 
                                              min_samples_split = 8, min_samples_leaf = 1, n_jobs = -1, random_state = 6, 
                                              verbose = 1))])
        
    %time clf_pipe.fit(X[train],y[train])
    yhat = clf_pipe.predict(X[test])
    total_acc = mt.accuracy_score(y[test],yhat)
    print('Random Forest Accuracy: ', total_acc)
    it += 1
    print('############################')

######## CV SPLIT  0  #######


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:   21.9s
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:   56.4s
[Parallel(n_jobs=-1)]: Done 736 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:  2.3min finished


Wall time: 2min 21s


[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.5s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    1.7s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:    3.3s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:    4.4s finished


Random Forest Accuracy:  0.5799034495263135
############################
######## CV SPLIT  1  #######


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:   22.0s
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:   56.5s
[Parallel(n_jobs=-1)]: Done 736 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:  2.3min finished


Wall time: 2min 23s


[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.6s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    1.7s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:    3.2s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:    4.3s finished


Random Forest Accuracy:  0.5568650559811432
############################
######## CV SPLIT  2  #######


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:   22.1s
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:   57.0s
[Parallel(n_jobs=-1)]: Done 736 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:  2.3min finished


Wall time: 2min 23s


[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.5s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    1.5s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:    2.8s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:    3.8s finished


Random Forest Accuracy:  0.5631041555704135
############################
######## CV SPLIT  3  #######


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:   22.2s
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:   56.7s
[Parallel(n_jobs=-1)]: Done 736 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:  2.3min finished


Wall time: 2min 23s


[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.4s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    1.2s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:    2.5s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:    3.7s finished


Random Forest Accuracy:  0.5329714538263658
############################
######## CV SPLIT  4  #######


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:   22.2s
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:   56.9s
[Parallel(n_jobs=-1)]: Done 736 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:  2.3min finished


Wall time: 2min 22s


[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.2s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    0.7s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:    1.5s


Random Forest Accuracy:  0.45388817116565805
############################


[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:    1.9s finished


In [17]:
# Use the interaction term dataset 
X_pre = trans_nol3.drop(columns = 'price_class').values
X = np.append(X_pre, int_vectors, axis = 1)
y = trans_nol3.price_class.values

it = 0

# Train CV Random Forest Model
for train, test in cv_obj.split(X,y):
    print('######## CV SPLIT ', it, ' #######')
    clf_pipe = Pipeline([('PCA', PCA(n_components = .9999, svd_solver = 'full', random_state = 6)),
                ('Scaler', StandardScaler()),
                ('CLF', RandomForestClassifier(n_estimators = 1000, criterion = 'entropy', max_depth = 100, 
                                              min_samples_split = 9, min_samples_leaf = 1, n_jobs = -1, random_state = 6, 
                                              verbose = 1))])
        
    %time clf_pipe.fit(X[train],y[train])
    yhat = clf_pipe.predict(X[test])
    total_acc = mt.accuracy_score(y[test],yhat)
    print('Random Forest Accuracy: ', total_acc)
    it += 1
    print('############################')

######## CV SPLIT  0  #######


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:   37.8s
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 736 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:  4.0min finished


Wall time: 4min 6s


[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.5s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    1.5s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:    3.0s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:    4.0s finished


Random Forest Accuracy:  0.5825551878881283
############################
######## CV SPLIT  1  #######


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:   38.4s
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 736 tasks      | elapsed:  3.1min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:  4.1min finished


Wall time: 4min 7s


[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.4s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    1.4s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:    2.9s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:    3.8s finished


Random Forest Accuracy:  0.5590295090884366
############################
######## CV SPLIT  2  #######


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:   38.9s
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 736 tasks      | elapsed:  3.1min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:  4.1min finished


Wall time: 4min 9s


[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.5s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    1.5s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:    2.9s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:    3.9s finished


Random Forest Accuracy:  0.5653592919551692
############################
######## CV SPLIT  3  #######


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:   38.1s
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 736 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:  4.0min finished


Wall time: 4min 3s


[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.4s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    1.2s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:    2.4s
[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:    3.2s finished


Random Forest Accuracy:  0.5351245991183436
############################
######## CV SPLIT  4  #######


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:   37.3s
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 736 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:  4.0min finished


Wall time: 4min 1s


[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.2s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    0.7s
[Parallel(n_jobs=32)]: Done 736 tasks      | elapsed:    1.4s


Random Forest Accuracy:  0.4546587792661091
############################


[Parallel(n_jobs=32)]: Done 1000 out of 1000 | elapsed:    1.9s finished


In [19]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# Train CV ADABOOST Decision tree models
# Use the interaction term dataset 
X_pre = trans_nol3.drop(columns = 'price_class').values
X = np.append(X_pre, int_vectors, axis = 1)
y = trans_nol3.price_class.values

it = 0

for train, test in cv_obj.split(X,y):
    print('######## CV SPLIT ', it, ' #######')
    clf_pipe = Pipeline([('PCA', PCA(n_components = .99, svd_solver = 'full', random_state = 6)),
                ('Scaler', StandardScaler()),
                ('ADA', AdaBoostClassifier(n_estimators = 1000, random_state = 6))])
        
    %time clf_pipe.fit(X[train],y[train])
    yhat = clf_pipe.predict(X[test])
    total_acc = mt.accuracy_score(y[test],yhat)
    print('Random Forest Accuracy: ', total_acc)
    it += 1
    print('############################')

######## CV SPLIT  0  #######
Wall time: 15min 55s
Random Forest Accuracy:  0.5983749603372467
############################
######## CV SPLIT  1  #######
Wall time: 15min 15s
Random Forest Accuracy:  0.5226304337971986
############################
######## CV SPLIT  2  #######
Wall time: 15min 9s
Random Forest Accuracy:  0.5391135840803236
############################
######## CV SPLIT  3  #######
Wall time: 15min 44s
Random Forest Accuracy:  0.5132871729202316
############################
######## CV SPLIT  4  #######
Wall time: 16min 47s
Random Forest Accuracy:  0.4673058180911584
############################


## Random Forest Regression
Using the reduced dataset excluding the price classification and L3 variables, we will use random forest regression to predict a property's log price.

In [20]:
from sklearn.ensemble import RandomForestRegressor
#regressor = RandomForestRegressor(n_estimators = 1000, random_state = 6, max_depth = None, min_samples_split = 8, n_jobs = -1,
#                                 verbose = 2)

In [26]:
# First we will use the reduced (no l3) dataset for our modeling.
from sklearn.model_selection import ShuffleSplit

# Use the dataset with no L3 to reduce computation time
X = trans_nol3.drop(columns = ['price_class', 'log_price']).values
y = trans_nol3.log_price.values

cv_obj = ShuffleSplit(n_splits=5, test_size  = 0.2, random_state = 6)

In [29]:
# Fit the regressor
it = 0
yhat = np.zeros(y.shape)

for train, test in cv_obj.split(X,y):
    print('######## CV SPLIT ', it, ' #######')
    clf_pipe = Pipeline([('Scaler', StandardScaler()),
                         ('PCA', PCA(n_components = .99, svd_solver = 'full', random_state = 6)),
                         ('REG', RandomForestRegressor(n_estimators = 500, random_state = 6, max_depth = None, 
                                                       min_samples_split = 8, n_jobs = -1, verbose = 1))])
        
    %time clf_pipe.fit(X[train],y[train])
    yhat[test] = clf_pipe.predict(X[test])
    mse = mt.mean_squared_error(y[test], yhat[test])
    print('Random Forest Regression MSE: ', mse)
    it += 1
    print('#############################\n')

######## CV SPLIT  0  #######


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:    7.5s
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:   19.2s
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:   23.8s finished


Wall time: 27.3 s


[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.4s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    1.2s
[Parallel(n_jobs=32)]: Done 500 out of 500 | elapsed:    1.5s finished


Random Forest Regression MSE:  0.3135890770406205
#############################

######## CV SPLIT  1  #######


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:    7.4s
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:   19.0s
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:   23.7s finished


Wall time: 27.1 s


[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.4s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    1.2s
[Parallel(n_jobs=32)]: Done 500 out of 500 | elapsed:    1.5s finished


Random Forest Regression MSE:  0.3209066699100948
#############################

######## CV SPLIT  2  #######


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:    7.7s
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:   19.6s
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:   24.2s finished


Wall time: 27.7 s


[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.4s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    1.2s
[Parallel(n_jobs=32)]: Done 500 out of 500 | elapsed:    1.5s finished


Random Forest Regression MSE:  0.31595640367746325
#############################

######## CV SPLIT  3  #######


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:    7.5s
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:   18.9s
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:   23.6s finished


Wall time: 27.1 s


[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.4s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    1.2s
[Parallel(n_jobs=32)]: Done 500 out of 500 | elapsed:    1.5s finished


Random Forest Regression MSE:  0.31703652409934213
#############################

######## CV SPLIT  4  #######


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:    7.8s
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:   19.8s
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:   24.3s finished


Wall time: 27.8 s


[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.4s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    1.2s


Random Forest Regression MSE:  0.31923035846918235
#############################



[Parallel(n_jobs=32)]: Done 500 out of 500 | elapsed:    1.5s finished


Run again but inverse transform back to log price for MSE

In [31]:
# Fit the regressor
it = 0
yhat = np.zeros(y.shape)

for train, test in cv_obj.split(X,y):
    print('######## CV SPLIT ', it, ' #######')
    reg_pipe = Pipeline([('PCA', PCA(n_components = .99, svd_solver = 'full', random_state = 6)),
                ('Scaler', StandardScaler()),
                ('REG', RandomForestRegressor(n_estimators = 500, random_state = 6, max_depth = None, min_samples_split = 8, 
                                             n_jobs = -1, verbose = 1))])
        
    %time reg_pipe.fit(X[train])
    %time reg_pipe.transform(X[train])
    %time reg_pipe.transform(X[test])
    
    %time reg_pipe.fit(X[train],y[train])
    yhat[test] = reg_pipe.predict(X[test])
    mse = mt.mean_squared_error(y[test], yhat[test])
    print('Random Forest Regression MSE: ', mse)
    it += 1
    print('#############################\n')

######## CV SPLIT  0  #######


TypeError: Singleton array array(None, dtype=object) cannot be considered a valid collection.

AttributeError: 'RandomForestRegressor' object has no attribute 'transform'

AttributeError: 'RandomForestRegressor' object has no attribute 'transform'

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:    7.6s
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:   19.1s
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:   23.7s finished


Wall time: 27.1 s


[Parallel(n_jobs=32)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=32)]: Done 136 tasks      | elapsed:    0.4s
[Parallel(n_jobs=32)]: Done 386 tasks      | elapsed:    1.1s
[Parallel(n_jobs=32)]: Done 500 out of 500 | elapsed:    1.5s finished


Random Forest Regression MSE:  0.3135890770406205
#############################

######## CV SPLIT  1  #######


TypeError: Singleton array array(None, dtype=object) cannot be considered a valid collection.

AttributeError: 'RandomForestRegressor' object has no attribute 'transform'

AttributeError: 'RandomForestRegressor' object has no attribute 'transform'

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 136 tasks      | elapsed:    7.6s
[Parallel(n_jobs=-1)]: Done 386 tasks      | elapsed:   19.6s


KeyboardInterrupt: 

NotFittedError: Estimator not fitted, call `fit` before exploiting the model.

In [35]:
yhat[test]

array([ 0.        ,  0.        ,  0.        , ...,  0.        ,
       12.07724221,  0.        ])