The purpose of using AutoML Tpot is to find the best predictive model and its hyperparameters under the dataset.

Part A: import the cleaned data and select X and y

Part B: Investigate the best parameters of tpot

Part C: Using the best parameters of tpot to find the best model & hyperparameters and then export

# Part A: import the cleaned data and select X and y

In [2]:
# Load ML Pkgs
from sklearn.model_selection import train_test_split
import pandas as pd 
from sklearn.model_selection import RepeatedKFold

# AutoML with TPOT
import tpot

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
df = pd.read_csv('raw data/Chicago_weather_cleaned_V1.csv')
df

Unnamed: 0.1,Unnamed: 0,dt_iso,temp,dew_point,feels_like,pressure,humidity,wind_speed,wind_deg,clouds_all,date,year,month,day_of_week,hour_of_day,visibility_clean
0,0,2013-01-01 00:00:00,-2.87,-7.38,-7.90,1018,68,4.12,300,100,2013-01-01,2013,1,1,0,10000.00
1,1,2013-01-01 01:00:00,-3.12,-7.45,-7.35,1019,69,3.10,310,100,2013-01-01,2013,1,1,1,10000.00
2,2,2013-01-01 02:00:00,-3.12,-7.45,-6.83,1019,69,2.60,290,100,2013-01-01,2013,1,1,2,10000.00
3,3,2013-01-01 03:00:00,-2.87,-7.72,-7.90,1019,66,4.12,360,100,2013-01-01,2013,1,1,3,10000.00
4,4,2013-01-01 04:00:00,-4.17,-9.32,-10.57,1020,64,5.70,330,100,2013-01-01,2013,1,1,4,10000.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
86179,91354,2022-10-31 19:00:00,13.89,12.11,13.66,1011,89,1.34,27,100,2022-10-31,2022,10,0,19,7543.75
86180,91355,2022-10-31 20:00:00,14.15,12.19,13.92,1010,88,2.57,350,100,2022-10-31,2022,10,0,20,9656.00
86181,91356,2022-10-31 21:00:00,14.11,11.81,13.82,1011,86,3.13,351,100,2022-10-31,2022,10,0,21,9656.00
86182,91357,2022-10-31 22:00:00,13.62,11.84,13.36,1012,89,1.34,343,100,2022-10-31,2022,10,0,22,9656.00


In [4]:
df.columns

Index(['Unnamed: 0', 'dt_iso', 'temp', 'dew_point', 'feels_like', 'pressure',
       'humidity', 'wind_speed', 'wind_deg', 'clouds_all', 'date', 'year',
       'month', 'day_of_week', 'hour_of_day', 'visibility_clean'],
      dtype='object')

In [5]:
df=df.drop(columns=['dt_iso', 'Unnamed: 0', 'date', 'year', 'month', 'day_of_week', 'hour_of_day'])
df

Unnamed: 0,temp,dew_point,feels_like,pressure,humidity,wind_speed,wind_deg,clouds_all,visibility_clean
0,-2.87,-7.38,-7.90,1018,68,4.12,300,100,10000.00
1,-3.12,-7.45,-7.35,1019,69,3.10,310,100,10000.00
2,-3.12,-7.45,-6.83,1019,69,2.60,290,100,10000.00
3,-2.87,-7.72,-7.90,1019,66,4.12,360,100,10000.00
4,-4.17,-9.32,-10.57,1020,64,5.70,330,100,10000.00
...,...,...,...,...,...,...,...,...,...
86179,13.89,12.11,13.66,1011,89,1.34,27,100,7543.75
86180,14.15,12.19,13.92,1010,88,2.57,350,100,9656.00
86181,14.11,11.81,13.82,1011,86,3.13,351,100,9656.00
86182,13.62,11.84,13.36,1012,89,1.34,343,100,9656.00


In [6]:
# Define X and y
X = df.drop(columns=['temp'])
y = df['temp']

In [7]:
# Split in train and test
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)

# Part B: Investigate the best parameters of tpot

Variables that would be tested:
1. cv using kfold
2. N_jobs = -1/ other numbers
3. Generations
4. Population_size

In [27]:
# 1.Cv Kfold
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# Init
model = tpot.TPOTRegressor(generations=5, population_size=50, scoring='neg_mean_absolute_error',
        cv=cv, verbosity=2, n_jobs=-1)
model.fit(x_train,y_train)

                                                                               
Generation 1 - Current best internal CV score: -0.07926301035162608
                                                                              
Generation 2 - Current best internal CV score: -0.07926301035162608
                                                                              
Generation 3 - Current best internal CV score: -0.06827032956335866
                                                                                
Generation 4 - Current best internal CV score: -0.06827032956335866
                                                                                
Generation 5 - Current best internal CV score: -0.06783480867038125
                                                             
Best pipeline: ExtraTreesRegressor(MaxAbsScaler(input_matrix), bootstrap=True, max_features=0.7500000000000001, min_samples_leaf=6, min_samples_split=11, n_estimators=100)


In [30]:
# 1.Cv Deafult, N_jobs=-1, generations=5, population_size=50 
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# Init
model = tpot.TPOTRegressor(generations=5, population_size=50, scoring='neg_mean_absolute_error',
        verbosity=2, n_jobs=-1)
model.fit(x_train,y_train)

                                                                              
Generation 1 - Current best internal CV score: -0.04435695651482135
                                                                              
Generation 2 - Current best internal CV score: -0.03297609467166549
                                                                              
Generation 3 - Current best internal CV score: -0.03297609467166549
                                                                              
Generation 4 - Current best internal CV score: -0.023131807684041196
                                                                              
Generation 5 - Current best internal CV score: -0.0228571792000878
                                                           
Best pipeline: ExtraTreesRegressor(GradientBoostingRegressor(CombineDFs(input_matrix, input_matrix), alpha=0.75, learning_rate=0.001, loss=huber, max_depth=3, max_features=0.8500000000000001, min_samples_l

In [31]:
#2.N_jobs=-1
model = tpot.TPOTRegressor(generations=5, population_size=50, scoring='neg_mean_absolute_error',
        verbosity=2, n_jobs=-1)
model.fit(x_train,y_train)

                                                                              
Generation 1 - Current best internal CV score: -0.03775511948490618
                                                                              
Generation 2 - Current best internal CV score: -0.03340676415880757
                                                                              
Generation 3 - Current best internal CV score: -0.03340676415880757
                                                                              
Generation 4 - Current best internal CV score: -0.031692248610065206
                                                                              
Generation 5 - Current best internal CV score: -0.031692248610065206
                                                           
Best pipeline: RandomForestRegressor(input_matrix, bootstrap=True, max_features=0.9000000000000001, min_samples_leaf=7, min_samples_split=15, n_estimators=100)


In [32]:
#2.N_jobs=1
model = tpot.TPOTRegressor(generations=5, population_size=50, scoring='neg_mean_absolute_error',
        verbosity=2, n_jobs=1)
model.fit(x_train,y_train)

                                                                                
Generation 1 - Current best internal CV score: -0.028141395297989404
                                                                                   
Generation 2 - Current best internal CV score: -0.028141395297989404
                                                                                   
Generation 3 - Current best internal CV score: -0.028141395297989404
                                                                                   
Generation 4 - Current best internal CV score: -0.026268627088602337
                                                                                
Generation 5 - Current best internal CV score: -0.02287597727351713
                                                             
Best pipeline: ExtraTreesRegressor(StandardScaler(CombineDFs(ZeroCount(input_matrix), input_matrix)), bootstrap=False, max_features=1.0, min_samples_leaf=2, min_samples_split=18, n

In [33]:
#2.N_jobs=4
model = tpot.TPOTRegressor(generations=5, population_size=50, scoring='neg_mean_absolute_error',
        verbosity=2, n_jobs=4)
model.fit(x_train,y_train)

                                                                              
Generation 1 - Current best internal CV score: -0.028680745407945844
                                                                              
Generation 2 - Current best internal CV score: -0.028634971634393526
                                                                              
Generation 3 - Current best internal CV score: -0.028270243965987422
                                                                              
Generation 4 - Current best internal CV score: -0.026640467489832093
                                                                                
Generation 5 - Current best internal CV score: -0.024341824772495572
                                                             
Best pipeline: ExtraTreesRegressor(AdaBoostRegressor(MinMaxScaler(input_matrix), learning_rate=0.01, loss=square, n_estimators=100), bootstrap=False, max_features=0.8500000000000001, min_samples_l

In [34]:
#3.generations=10
model = tpot.TPOTRegressor(generations=10, population_size=50, scoring='neg_mean_absolute_error',
        verbosity=2, n_jobs=-1)
model.fit(x_train,y_train)

                                                                               
Generation 1 - Current best internal CV score: -0.040358810241249765
                                                                              
Generation 2 - Current best internal CV score: -0.040358810241249765
                                                                                
Generation 3 - Current best internal CV score: -0.040358810241249765
                                                                                
Generation 4 - Current best internal CV score: -0.03411373542950239
                                                                                
Generation 5 - Current best internal CV score: -0.02660872316630091
                                                                                
Generation 6 - Current best internal CV score: -0.02634566752596343
                                                                                
Generation 7 - Current be

In [35]:
#3.generations=15
model = tpot.TPOTRegressor(generations=15, population_size=50, scoring='neg_mean_absolute_error',
        verbosity=2, n_jobs=-1)
model.fit(x_train,y_train)

                                                                               
Generation 1 - Current best internal CV score: -0.04603080397642954
                                                                                
Generation 2 - Current best internal CV score: -0.04470645959516754
                                                                                
Generation 3 - Current best internal CV score: -0.04470645959516754
                                                                                
Generation 4 - Current best internal CV score: -0.028928976918645564
                                                                                
Generation 5 - Current best internal CV score: -0.028928976918645564
                                                                                
Generation 6 - Current best internal CV score: -0.028928976918645564
                                                                                  
Generation 7 - Curren

In [36]:
#4.population_size=100
model = tpot.TPOTRegressor(generations=5, population_size=100, scoring='neg_mean_absolute_error',
        verbosity=2, n_jobs=-1)
model.fit(x_train,y_train)

                                                                                
Generation 1 - Current best internal CV score: -0.028682786567926704
                                                                                
Generation 2 - Current best internal CV score: -0.026077543429349155
                                                                                
Generation 3 - Current best internal CV score: -0.026077543429349155
                                                                                
Generation 4 - Current best internal CV score: -0.025127623424690604
                                                                                
Generation 5 - Current best internal CV score: -0.02414594422299557
                                                             
Best pipeline: ExtraTreesRegressor(PolynomialFeatures(input_matrix, degree=2, include_bias=False, interaction_only=False), bootstrap=False, max_features=0.7500000000000001, min_samples_leaf

In [37]:
#4.population_size=150
model = tpot.TPOTRegressor(generations=5, population_size=150, scoring='neg_mean_absolute_error',
        verbosity=2, n_jobs=-1)
model.fit(x_train,y_train)

                                                                                
Generation 1 - Current best internal CV score: -0.026373980288600198
                                                                                  
Generation 2 - Current best internal CV score: -0.026373980288600198
                                                                                  
Generation 3 - Current best internal CV score: -0.026373980288600198
                                                                                  
Generation 4 - Current best internal CV score: -0.021452746844401748
                                                                                
Generation 5 - Current best internal CV score: -0.021452746844401748
                                                             
Best pipeline: ExtraTreesRegressor(ElasticNetCV(CombineDFs(input_matrix, PolynomialFeatures(input_matrix, degree=2, include_bias=False, interaction_only=False)), l1_ratio=0.700000000

Part C: Using the best parameters of tpot to find the best model & hyperparameters and then export

In [8]:
# Best parameters of tpot
model = tpot.TPOTRegressor(generations=100, population_size=50, scoring='neg_mean_absolute_error',
        verbosity=2, n_jobs=-1)
model.fit(x_train,y_train)


                                                                                 
Generation 1 - Current best internal CV score: -0.03793044831586472
                                                                                  
Generation 2 - Current best internal CV score: -0.021636087692851653
                                                                                  
Generation 3 - Current best internal CV score: -0.021591179452368253
                                                                                  
Generation 4 - Current best internal CV score: -0.021591179452368253
                                                                                  
Generation 5 - Current best internal CV score: -0.021591179452368253
                                                                                  
Generation 6 - Current best internal CV score: -0.021468385787685614
                                                                                    
Gener

In [None]:
# Review the score
model.score(x_test,y_test)



-0.063059784755875

In [None]:
# Export the result
model.export('tpot_ml_pipeline.py')