In [14]:
import pandas as pd
import numpy as np
import os
import sklearn
from operator import mul
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score, make_scorer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler

## Predicting the distillation profile of a mixture of two liquids

The general approach taken is, given the distillation profile of the two liquids, composition by mass% of different compounds and density are predicted using a Random Forest Regressor model. A multi-output model is used because the composition of different compounds are not independent of each other.

Using the mass% and density of both the liquids, the composition by mass% of the final liquid is calculated.

The mass% composition of the final mixture is used to predict the distillation profile of the mixture. The distillation profile is predicted using separate models for each of the output variables since they are not dependent on each other.

This is the main function within which there are calls to multiple other functions

In [15]:
def distillation_profile_function(dist_prof1, vol1, dist_prof2, vol2):
    '''
    Predicting the distillation profile of the mixture of two liquids
    
    Parameters:
    -----------------------------
        dist_prof1: dict, distillation profile for liquid-1
        vol1: int, volume of liquid-1 in the mixture
        dist_prof2: dict, distillation profile for liquid-2
        vol2: int, volume of liquid-2 in the mixture
        
    Returns:
    -----------------------------
        a dictionary with the distillation profile of the mixture
    '''
    
    # Defining lists to be used frequently
    properties = ['Density', 'Butanes', 'Heptanes', 'Hexanes', 'Octanes', 'Nonanes', 'Decanes']
    profile = ['5%', '10%', '20%', '30%', '40%', '50%', '60%', '70%', '80%']
    
    # Call function to read data
    data = read_data(properties, profile)
    
    # Call function to predict the properties of the liquid
    liquid_1, liquid_2 = pred_model(data, dist_prof1, dist_prof2, properties, profile)
    
    # Calculate the final composition of the mixture
    final_mixture_comp = pred_mixture_comp(liquid_1[0], liquid_2[0], vol1, vol2)
    
    # Predict the final distillation profile of the mixture
    dist_prof_final = pred_dist_prof(data, final_mixture_comp, properties, profile)
    
    dist_prof_final = np.array(dist_prof_final).flatten()
    res = {profile[i]: dist_prof_final[i] for i in range(len(dist_prof_final))}
    
    return res

The following function reads all the files in the data folder and takes only the specified columns and returns the data frame to the calling function after renamiing the columns. I chose to drop null values in this dataframe it wouldn't be accurate to take the mean or median of the column and predict the missing value. 

Under the assumption that, compounds with similar composition by mass will have similar distillation profiles, a collaborative recommendation system approach can be tried to predict missing values.

In [16]:
def read_data(properties, profile):
    '''
    Reading data from multiple files in the data folder of the repo
    
    Parameters:
    -----------------------------
        properties: list, list of the properties of the liquid
        profile: list, list of percentages in the distillation profile
    
    Returns:
    -----------------------------
        a dataframe with the data from all files
    '''
    
    # Column names in the original dataframe
    required_columns = ['Density (kg/m^3)', 'Butanes (mass%)', 'Heptanes (mass%)', 'Hexanes (mass%)', 'Octanes (mass%)','Nonanes (mass%)', 'Decanes (mass%)','5% Off (deg. C)', '10% Off (deg. C)', '20% Off (deg. C)', '30% Off (deg. C)', '40% Off (deg. C)', '50% Off (deg. C)', '60% Off (deg. C)', '70% Off (deg. C)', '80% Off (deg. C)']
    
    # Creating an empty dataframe with the required columns
    final_data = pd.DataFrame(columns = required_columns)
    
    # Reading the files in the data directory
    for filename in os.listdir("data"):
        if filename.endswith(".CSV"):
            data = pd.read_csv(f"data/{filename}")
            data = data[required_columns]
            final_data = final_data.append(data, ignore_index = True)
    
    # Drop na from the dataset
    final_data = final_data.dropna()
    
    # Check if data type is object and remove any '-' that appears
    for column in required_columns:
        if final_data[column].dtype == object:
            final_data.drop(final_data.index[final_data[column] == '-'], inplace = True)
            final_data[column] = final_data[column].astype(float)
    print(final_data.shape)
    
    # Renaming the columns in the dataframe
    final_data.columns = [*properties, *profile]
    
    return final_data

The pred_model function takes the distillation profiles as features and predicts the composition by mass% of the liquid.

Hyper-parameter opitmization for this model is towards the end of the notebook. 

In [17]:
def pred_model(data, dist_prof1, dist_prof2, properties, profile):
    '''
    Predicting the mass% of different compounds in the liquids given the distillation profile
    
    Parameters:
    -----------------------------
        data: dataframe, the input data with the features and target variables for training the model
        dist_prof1: dict, distillation profile for liquid-1
        dist_prof2: dict, distillation profile for liquid-2
        properties: list, list of the properties of the liquid
        profile: list, list of percentages in the distillation profile
        
    Returns:
    -----------------------------
        two arrays with the composition by mass of the compounds in the liquid
    '''
    
    X, y = data[[*profile]], data[[*properties]]
    
    final_model = RandomForestRegressor(max_depth = 10, n_estimators = 5)
    final_model.fit(X, y)
    liq1 = dist_prof1.values()
    liq2 = dist_prof2.values()
    liquid_1 = final_model.predict([list(liq1)])
    liquid_2 = final_model.predict([list(liq2)])
    
    return liquid_1, liquid_2

In the pred_mixture_comp function, I calculate the mass using (density * volume) relation and calculate the composition in the final liquid using ((Mass. of liquid-1)\*(mass% of a compound in liquid-1) + (Mass. of liquid-2)\*(mass% of a compound in liquid-2)) / (Mass. of liquid-1 + Mass. of liquid-2)

In [18]:
def pred_mixture_comp(liquid_1, liquid_2, vol1, vol2):
    '''
    Find the composition by mass% in the final mixture
    
    Parameters:
    -----------------------------
        liquid_1: array, composition by mass of liquid-1
        liquid_2: array, composition by mass of liquid-2
        vol1: int, volume of liquid-1 in the mixture
        vol2: int, volume of liquid-2 in the mixture
        
    Returns:
    -----------------------------
        an array with the composition by mass of the mixture
    '''
    # Calculate mass of liquid
    mass_1, mass_2 = liquid_1[0] * vol1, liquid_2[0] * vol2
    
    # Calculate final mixture composition
    final_mixture_comp = ((mass_1 * liquid_1[1:]) + (mass_2 * liquid_2[1:]))/ (mass_1 + mass_2)
    
    # Calculate density of the mixture
    density_mixture = (mass_1 + mass_2)/(vol1 + vol2)
    
    # Insert density at the first position in the array
    final_mixture_comp = np.insert(final_mixture_comp, 0, density_mixture)
    
    return final_mixture_comp

In the pred_dist_prof function, using the composition of the mixture, I try to predict the distillation profile of the mixture by using a separate model for each of the percentages required.

In [19]:
def pred_dist_prof(data, final_mixture_comp, properties, profile):
    '''
    Predict the distillation profile given the composition by mass% of a liquid
    
    Parameters:
    -----------------------------
        data: dataframe, the input data with the features and target variables for training the model
        final_mixture_comp: array, composition by mass of the liquid
        properties: list, list of the properties of the liquid
        profile: list, list of percentages in the distillation profile
        
    Returns:
    -----------------------------
        an array with the composition by mass of the mixture
    '''
    
    y, X = data[[*profile]], data[[*properties]]
    preds = []
    for i in [*profile]:
        y_hat = data[[i]]
        final_model = Ridge()
        final_model.fit(X, y_hat)
        preds.append(final_model.predict([final_mixture_comp]))
        
    return preds

In [20]:
dist_prof1 = {'5%': 38.1, '10%': 81.8, '20%': 203.5, '30%': 320.1, '40%': 394.6, '50%': 461.5, '60%': 538.5, '70%': 627.2, '80%': 713.1}

In [21]:
dist_prof2 = {'5%': 41.2, '10%': 101.6, '20%': 251.5, '30%': 334.2, '40%': 404.9, '50%': 467.7, '60%': 537.7, '70%': 615.9, '80%': 714}

In [22]:
distillation_profile_function(dist_prof1, 80, dist_prof2, 50)

(293, 16)


{'5%': 38.55498610197429,
 '10%': 81.64136101115216,
 '20%': 229.022747375687,
 '30%': 331.4715685769024,
 '40%': 401.60818035300986,
 '50%': 465.0672349775625,
 '60%': 535.3246893525913,
 '70%': 613.0417418369934,
 '80%': 692.2513397038056}

## End of Functions

### Just some data exploration done before writing functions- (Scroll down for hyper-parameter optimization process)

In [36]:
X, y = final_data[['5%', '10%', '20%', '30%', '40%', '50%', '60%', '70%', '80%']], final_data[['Density', 'Butanes', 'Heptanes', 'Hexanes', 'Octanes', 'Nonanes', 'Decanes']]

In [213]:
final_data.drop(final_data.index[final_data['Decanes'] == '-'], inplace = True)
final_data['Decanes'] = final_data['Decanes'].astype(float)

In [5]:
model = LinearRegression()

In [6]:
model.fit(X, y)

LinearRegression()

In [7]:
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

In [8]:
n_scores = cross_validate(model, X, y, cv=cv, n_jobs=-1, return_train_score = True)

In [37]:
param_grid_dtree = {"decisiontreeregressor__max_depth": [5, 10, 15, 20]}
param_grid_rf = {"randomforestregressor__n_estimators" : [4, 5, 8, 15, 30],
                "randomforestregressor__max_depth" : [5, 8, 10, 13]}
param_ridge = {}

In [38]:
pipe_rf = make_pipeline(StandardScaler(), RandomForestRegressor(random_state=2))
pipe_tree = make_pipeline(StandardScaler(), DecisionTreeRegressor())
pipe_ridge = make_pipeline(StandardScaler(), Ridge())

regressors = {
    "ridge": [pipe_ridge, param_ridge],
    "random_forest": [pipe_rf, param_grid_rf],
    "DTree": [pipe_tree, param_grid_dtree]
}

In [39]:
for name, model in regressors.items():  
    grid_search = GridSearchCV(model[0], model[1], cv=5, n_jobs=-1, return_train_score=True, refit = 'neg_mean_squared_error', scoring = ['neg_mean_squared_error'])
    grid_search.fit(X, y)
    print(f"Best score from {name} grid search: %.3f" % grid_search.best_score_)
    print(f"Best set of params from {name}", grid_search.best_params_)
    metrics = pd.DataFrame(grid_search.cv_results_)
    display(metrics[['mean_fit_time', 'mean_score_time', 'params', 'mean_train_neg_mean_squared_error', 'mean_test_neg_mean_squared_error']])


Best score from ridge grid search: -22.849
Best set of params from ridge {}


Unnamed: 0,mean_fit_time,mean_score_time,params,mean_train_neg_mean_squared_error,mean_test_neg_mean_squared_error
0,0.635619,0.006716,{},-11.748727,-22.848682


Best score from random_forest grid search: -8.692
Best set of params from random_forest {'randomforestregressor__max_depth': 10, 'randomforestregressor__n_estimators': 5}


Unnamed: 0,mean_fit_time,mean_score_time,params,mean_train_neg_mean_squared_error,mean_test_neg_mean_squared_error
0,0.041317,0.011908,"{'randomforestregressor__max_depth': 5, 'rando...",-2.638689,-11.049551
1,0.048654,0.026922,"{'randomforestregressor__max_depth': 5, 'rando...",-2.265859,-9.096219
2,0.109707,0.034542,"{'randomforestregressor__max_depth': 5, 'rando...",-1.98318,-10.418235
3,0.143291,0.01217,"{'randomforestregressor__max_depth': 5, 'rando...",-1.856992,-10.061001
4,0.130364,0.014627,"{'randomforestregressor__max_depth': 5, 'rando...",-1.640248,-9.469581
5,0.033067,0.008626,"{'randomforestregressor__max_depth': 8, 'rando...",-1.968588,-10.607329
6,0.034962,0.009374,"{'randomforestregressor__max_depth': 8, 'rando...",-1.543928,-8.764359
7,0.053485,0.008924,"{'randomforestregressor__max_depth': 8, 'rando...",-1.217961,-10.158635
8,0.082385,0.012098,"{'randomforestregressor__max_depth': 8, 'rando...",-1.057382,-9.858074
9,0.167346,0.015104,"{'randomforestregressor__max_depth': 8, 'rando...",-0.815616,-9.53707


Best score from DTree grid search: -8.364
Best set of params from DTree {'decisiontreeregressor__max_depth': 10}


Unnamed: 0,mean_fit_time,mean_score_time,params,mean_train_neg_mean_squared_error,mean_test_neg_mean_squared_error
0,0.024863,0.011431,{'decisiontreeregressor__max_depth': 5},-1.928085,-30.692306
1,0.020801,0.006963,{'decisiontreeregressor__max_depth': 10},-0.249789,-8.364025
2,0.016164,0.00984,{'decisiontreeregressor__max_depth': 15},-0.007885,-20.936551
3,0.0146,0.006868,{'decisiontreeregressor__max_depth': 20},0.0,-8.640407


In [40]:
final_model = RandomForestRegressor(max_depth = 10, n_estimators = 5)

In [41]:
final_model.fit(X, y)

RandomForestRegressor(max_depth=10, n_estimators=5)

In [42]:
liquid_1 = final_model.predict([[56.6, 101.0, 144.5, 188.0, 240.6, 298.8, 356.2, 422.4, 512.4]])
liquid_1

array([[814.76 ,   3.132,   6.118,   5.072,   6.466,   5.452]])

In [43]:
liquid_2 = final_model.predict([[42.2, 83.6, 115.5, 160.1, 202.7, 257.6, 313.7, 378.5, 453.2]])

In [44]:
yhat2 = (80*liquid_1 + 80*liquid_2)/160
yhat2

array([[809.99 ,   3.148,   6.43 ,   5.383,   6.759,   5.526]])

In [193]:
model2 = LinearRegression()

In [194]:
model2.fit(y, X)

LinearRegression()

In [155]:
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

In [156]:
n_scores = cross_validate(model2, y, X, cv=cv, n_jobs=-1)

In [157]:
n_scores

{'fit_time': array([0.00897002, 0.00897026, 0.01380682, 0.01031113, 0.00981903,
        0.00809288, 0.00949216, 0.01027465, 0.0108161 , 0.00806618,
        0.00807571, 0.01051474, 0.00673294, 0.0079999 , 0.01540589,
        0.00790811, 0.00929785, 0.00803494, 0.00953698, 0.00862503,
        0.00789404, 0.00887775, 0.01219821, 0.00808406, 0.00798011,
        0.00795507, 0.00900722, 0.00796199, 0.00799298, 0.00792408]),
 'score_time': array([0.00650787, 0.00677776, 0.00669789, 0.00689173, 0.0063231 ,
        0.00663304, 0.00665689, 0.0070641 , 0.0126121 , 0.00642681,
        0.00661397, 0.00657415, 0.00634885, 0.00665712, 0.00685811,
        0.00632811, 0.00876641, 0.00658393, 0.00632024, 0.00674891,
        0.00888491, 0.00715899, 0.0064888 , 0.00641608, 0.00698185,
        0.00722504, 0.00627398, 0.00628495, 0.00585604, 0.004498  ]),
 'test_score': array([0.85819986, 0.81201396, 0.7945593 , 0.8590999 , 0.84516406,
        0.6761601 , 0.88042296, 0.82840952, 0.88388099, 0.89814187,
    

In [163]:
model2.predict(yhat2)

array([[ 45.07427094,  81.27973579, 123.28345484, 174.56318879,
        226.55160862, 282.44018151, 339.80504647, 404.53651119,
        480.60681055]])

In [45]:
y, X = final_data[['5%', '10%', '20%', '30%', '40%', '50%', '60%', '70%', '80%']], final_data[['Density', 'Butanes', 'Heptanes', 'Hexanes', 'Octanes', 'Nonanes']]
    
param_grid_dtree = {"decisiontreeregressor__max_depth": [5, 10, 15, 20]}
param_grid_rf = {"randomforestregressor__n_estimators" : [5, 10, 20, 50, 60],
                "randomforestregressor__max_depth" : [5, 10, 15, 17]}
param_ridge = {}
param_linear = {}
    
pipe_rf = make_pipeline(StandardScaler(), RandomForestRegressor(random_state=2))
pipe_tree = make_pipeline(StandardScaler(), DecisionTreeRegressor())
pipe_ridge = make_pipeline(StandardScaler(), Ridge())
pipe_linear = make_pipeline(StandardScaler(), LinearRegression())

regressors = {
    "linear": [pipe_linear, param_linear],
    "ridge": [pipe_ridge, param_ridge],
    "random_forest": [pipe_rf, param_grid_rf],
    "DTree": [pipe_tree, param_grid_dtree]
}
    
for name, model in regressors.items():  
    grid_search = GridSearchCV(model[0], model[1], cv=10, n_jobs=-1, return_train_score=True, refit = 'neg_mean_squared_error', scoring = ['neg_mean_squared_error'])
    grid_search.fit(X, y)
    print(f"Best score from {name} grid search: %.3f" % grid_search.best_score_)
    print(f"Best set of params from {name}", grid_search.best_params_)
    metrics = pd.DataFrame(grid_search.cv_results_)
    display(metrics[['mean_fit_time', 'mean_score_time', 'params', 'mean_train_neg_mean_squared_error', 'mean_test_neg_mean_squared_error']])
    


Best score from linear grid search: -502.054
Best set of params from linear {}


Unnamed: 0,mean_fit_time,mean_score_time,params,mean_train_neg_mean_squared_error,mean_test_neg_mean_squared_error
0,0.313906,0.007894,{},-331.809058,-502.053516


Best score from ridge grid search: -505.161
Best set of params from ridge {}


Unnamed: 0,mean_fit_time,mean_score_time,params,mean_train_neg_mean_squared_error,mean_test_neg_mean_squared_error
0,0.023643,0.008769,{},-333.084503,-505.160743


Best score from random_forest grid search: -403.435
Best set of params from random_forest {'randomforestregressor__max_depth': 17, 'randomforestregressor__n_estimators': 60}


Unnamed: 0,mean_fit_time,mean_score_time,params,mean_train_neg_mean_squared_error,mean_test_neg_mean_squared_error
0,0.033503,0.008823,"{'randomforestregressor__max_depth': 5, 'rando...",-167.931996,-410.463159
1,0.05331,0.010768,"{'randomforestregressor__max_depth': 5, 'rando...",-158.425217,-403.865969
2,0.089862,0.012236,"{'randomforestregressor__max_depth': 5, 'rando...",-156.324299,-403.473723
3,0.326072,0.038303,"{'randomforestregressor__max_depth': 5, 'rando...",-154.352223,-406.888202
4,0.47039,0.037951,"{'randomforestregressor__max_depth': 5, 'rando...",-154.310716,-407.31891
5,0.070263,0.018186,"{'randomforestregressor__max_depth': 10, 'rand...",-70.490213,-445.862678
6,0.060395,0.010382,"{'randomforestregressor__max_depth': 10, 'rand...",-54.538793,-433.745514
7,0.10136,0.013251,"{'randomforestregressor__max_depth': 10, 'rand...",-47.954009,-415.469686
8,0.22498,0.016181,"{'randomforestregressor__max_depth': 10, 'rand...",-45.163632,-411.680952
9,0.27947,0.01833,"{'randomforestregressor__max_depth': 10, 'rand...",-45.046828,-405.995168


Best score from DTree grid search: -436.987
Best set of params from DTree {'decisiontreeregressor__max_depth': 5}


Unnamed: 0,mean_fit_time,mean_score_time,params,mean_train_neg_mean_squared_error,mean_test_neg_mean_squared_error
0,0.034179,0.00954,{'decisiontreeregressor__max_depth': 5},-180.688471,-436.986801
1,0.034521,0.009666,{'decisiontreeregressor__max_depth': 10},-27.971977,-574.146281
2,0.017721,0.008404,{'decisiontreeregressor__max_depth': 15},-0.923241,-556.987986
3,0.01482,0.006397,{'decisiontreeregressor__max_depth': 20},-0.003897,-554.650676


In [47]:
for feature in y:
    print(feature)
    y_hat = final_data[[f"{feature}"]]
    final_model = Ridge()
    final_model.fit(X, )
    print(f"{feature}", pd.DataFrame(n_scores).mean())

5%
5% fit_time        0.010668
score_time      0.009812
test_score   -200.279282
dtype: float64
10%
10% fit_time        0.012997
score_time      0.007859
test_score   -320.430713
dtype: float64
20%
20% fit_time        0.013822
score_time      0.008164
test_score   -475.521200
dtype: float64
30%
30% fit_time        0.010616
score_time      0.006947
test_score   -197.994520
dtype: float64
40%
40% fit_time        0.010688
score_time      0.007064
test_score   -230.194772
dtype: float64
50%
50% fit_time        0.013494
score_time      0.006297
test_score   -281.209033
dtype: float64
60%
60% fit_time        0.009758
score_time      0.013709
test_score   -381.469446
dtype: float64
70%
70% fit_time        0.017086
score_time      0.011804
test_score   -497.643802
dtype: float64
80%
80% fit_time        0.010842
score_time      0.011276
test_score   -588.736174
dtype: float64


In [6]:
y, X = final_data[['5%', '10%', '20%', '30%', '40%', '50%', '60%', '70%', '80%']], final_data[['Density', 'Butanes', 'Heptanes', 'Hexanes', 'Octanes', 'Nonanes', 'Decanes']]

## Hyperparameters selection with GridSearchCV

In [189]:
X, y = final_data[['5%', '10%', '20%', '30%', '40%', '50%', '60%', '70%', '80%']], final_data[['Density', 'Butanes', 'Heptanes', 'Hexanes', 'Octanes', 'Nonanes']]
    
param_grid_dtree = {"decisiontreeregressor__max_depth": [5, 10, 15, 20]}
param_grid_rf = {"randomforestregressor__n_estimators" : [5, 10, 20, 50, 60],
                "randomforestregressor__max_depth" : [5, 10, 15, 17]}
param_ridge = {}
param_linear = {}
    
pipe_rf = make_pipeline(StandardScaler(), RandomForestRegressor(random_state=2))
pipe_tree = make_pipeline(StandardScaler(), DecisionTreeRegressor())
pipe_ridge = make_pipeline(StandardScaler(), Ridge())
pipe_linear = make_pipeline(StandardScaler(), LinearRegression())

regressors = {
    "linear": [pipe_linear, param_linear],
    "ridge": [pipe_ridge, param_ridge],
    "random_forest": [pipe_rf, param_grid_rf],
    "DTree": [pipe_tree, param_grid_dtree]
}
    
for name, model in regressors.items():  
    grid_search = GridSearchCV(model[0], model[1], cv=5, n_jobs=-1, return_train_score=True, refit = 'neg_mean_squared_error', scoring = ['neg_mean_squared_error'])
    grid_search.fit(X, y)
    print(f"Best score from {name} grid search: %.3f" % grid_search.best_score_)
    print(f"Best set of params from {name}", grid_search.best_params_)
    metrics = pd.DataFrame(grid_search.cv_results_)
    display(metrics[['mean_fit_time', 'mean_score_time', 'params', 'mean_train_neg_mean_squared_error', 'mean_test_neg_mean_squared_error']])

Best score from linear grid search: -31.214
Best set of params from linear {}


Unnamed: 0,mean_fit_time,mean_score_time,params,mean_train_neg_mean_squared_error,mean_test_neg_mean_squared_error
0,0.020215,0.010031,{},-10.185002,-31.213832


Best score from ridge grid search: -22.849
Best set of params from ridge {}


Unnamed: 0,mean_fit_time,mean_score_time,params,mean_train_neg_mean_squared_error,mean_test_neg_mean_squared_error
0,0.017576,0.009999,{},-11.748727,-22.848682


Best score from random_forest grid search: -8.692
Best set of params from random_forest {'randomforestregressor__max_depth': 10, 'randomforestregressor__n_estimators': 5}


Unnamed: 0,mean_fit_time,mean_score_time,params,mean_train_neg_mean_squared_error,mean_test_neg_mean_squared_error
0,0.058755,0.012433,"{'randomforestregressor__max_depth': 5, 'rando...",-2.265859,-9.096219
1,0.051954,0.010936,"{'randomforestregressor__max_depth': 5, 'rando...",-2.188134,-11.370755
2,0.081341,0.010714,"{'randomforestregressor__max_depth': 5, 'rando...",-1.709878,-9.400003
3,0.189599,0.015639,"{'randomforestregressor__max_depth': 5, 'rando...",-1.589523,-8.912567
4,0.220016,0.018417,"{'randomforestregressor__max_depth': 5, 'rando...",-1.603046,-9.089839
5,0.036595,0.008734,"{'randomforestregressor__max_depth': 10, 'rand...",-1.393418,-8.69204
6,0.067434,0.009063,"{'randomforestregressor__max_depth': 10, 'rand...",-1.266459,-11.087068
7,0.100756,0.011827,"{'randomforestregressor__max_depth': 10, 'rand...",-0.728373,-9.310693
8,0.251905,0.01758,"{'randomforestregressor__max_depth': 10, 'rand...",-0.592444,-8.802918
9,0.387505,0.030392,"{'randomforestregressor__max_depth': 10, 'rand...",-0.603435,-8.903126


Best score from DTree grid search: -21.100
Best set of params from DTree {'decisiontreeregressor__max_depth': 15}


Unnamed: 0,mean_fit_time,mean_score_time,params,mean_train_neg_mean_squared_error,mean_test_neg_mean_squared_error
0,0.053451,0.015281,{'decisiontreeregressor__max_depth': 5},-1.928085,-21.349194
1,0.031775,0.017848,{'decisiontreeregressor__max_depth': 10},-0.249789,-31.951354
2,0.022041,0.010947,{'decisiontreeregressor__max_depth': 15},-0.007885,-21.100225
3,0.015963,0.006903,{'decisiontreeregressor__max_depth': 20},0.0,-28.876543
