<div>
<img src="https://www.ul.ie/themes/custom/ul/logo.jpg" />
</div>

#**MSc in Artificial Intelligence and Machine Learning**
##CS6271 - Evolutionary Algorithms and Humanoid Robotics 2023
### Kaggle Competition


Module Leader: Conor Ryan

Developer: Allan De Lima

Link to access the competition: https://www.kaggle.com/competitions/cs6271-20234-final-project

Link to join the competition: https://www.kaggle.com/t/2b316ba38c144f23ac780c8fc898b4d7



## Introduction

Predict whether income exceeds $50K/yr based on census data. This is a shorter version of the also known as "Census Income" dataset (donated on 4/30/1996).

In [None]:
# Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Dataset

Class:

income: >50K, <=50K.


Listing of features:

age: continuous.

workclass: categorical (Private, Self-emp-not-inc, Local-gov, State-gov).

education: categorical (Bachelors, Some-college, HS-grad, Masters, Doctorate).

marital-status: categorical (Married-civ-spouse, Divorced, Never-married).

relationship: categorical (Wife, Husband, Not-in-family, Other-relative).

race: categorical (White, Asian-Pac-Islander, Black).

sex: categorical (Female, Male).

capital-gain: continuous.

capital-loss: continuous.

hours-per-week: continuous.

native-country: categorical (United-States, Others).


### Load the dataset

In [None]:
# Suppressing Warnings:
import warnings
warnings.filterwarnings("ignore")

In [None]:
## mount your Google drive
# 1) run this cell
# 2) sign in
# 3) verify your drive is mounted

from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


Clone the GRAPE repository at first because the dataset to be used is already there.

In [None]:
import os
# Get the library from our BDS research Group
# copy the path from your drive
PATH = '/content/drive/MyDrive/grape/'

# check if 'grape' already exists
if os.path.exists(PATH):
    print('grape directory already exists')
else:
    %cd /content/drive/MyDrive/
    !git clone https://github.com/bdsul/grape.git
    print('Cloning grape in your Drive')

# change directory to 'grape'
%cd /content/drive/MyDrive/grape/

grape directory already exists
/content/drive/MyDrive/grape


Now you have a grape folder in your Drive account.

Upload the files adult_training.csv and adult_test.csv to the folder grape/datasets in your Drive before running the next cells.

### Train set

In [None]:
train_file = 'datasets/adult_training.csv'

In [None]:
# load train set
df_train = pd.read_csv(PATH+train_file)
df_train.head()

Unnamed: 0,age,workclass,education,marital-status,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,28,Private,Bachelors,Never-married,Not-in-family,White,Male,0,0,40,United-States,<=50K
1,34,Self-emp-not-inc,Bachelors,Married-civ-spouse,Husband,Black,Male,0,1887,48,United-States,>50K
2,32,Private,Bachelors,Never-married,Not-in-family,Black,Female,0,0,40,United-States,<=50K
3,46,Private,Bachelors,Divorced,Not-in-family,White,Male,0,0,40,Others,<=50K
4,44,Private,Bachelors,Married-civ-spouse,Husband,White,Male,0,0,50,United-States,>50K


In [None]:
df_train.describe()

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week
count,5200.0,5200.0,5200.0,5200.0
mean,39.688077,1059.895,109.486346,42.786538
std,11.973363,6687.36408,442.694051,10.937644
min,17.0,0.0,0.0,1.0
25%,30.0,0.0,0.0,40.0
50%,38.0,0.0,0.0,40.0
75%,48.0,0.0,0.0,48.0
max,90.0,99999.0,2559.0,99.0


In [None]:
X_train = df_train.copy()
# warning: cannot drop it more than once
X_train.drop(['income'], axis=1, inplace=True)

In [None]:
X_train.loc[:, ['age', 'capital-gain', 'capital-loss', 'hours-per-week']] = (X_train.loc[:, ['age', 'capital-gain', 'capital-loss', 'hours-per-week']] - X_train.loc[:, ['age', 'capital-gain', 'capital-loss', 'hours-per-week']].mean())/X_train.loc[:, ['age', 'capital-gain', 'capital-loss', 'hours-per-week']].std()
#Using oneHot encoding on categorical (non binary) features
X_train2 = pd.get_dummies(X_train, columns=['workclass', 'education', 'marital-status', 'relationship', 'race'])

print(X_train2)


           age     sex  capital-gain  capital-loss  hours-per-week  \
0    -0.976173    Male     -0.158492     -0.247318       -0.254766   
1    -0.475061    Male     -0.158492      4.015219        0.476653   
2    -0.642098  Female     -0.158492     -0.247318       -0.254766   
3     0.527164    Male     -0.158492     -0.247318       -0.254766   
4     0.360126    Male     -0.158492     -0.247318        0.659508   
...        ...     ...           ...           ...             ...   
5195  1.445870  Female     -0.158492     -0.247318       -0.254766   
5196 -1.393767    Male     -0.158492     -0.247318       -0.254766   
5197  0.610682    Male     -0.158492     -0.247318        2.030918   
5198  2.197538    Male      2.839849     -0.247318       -0.254766   
5199  0.109570    Male     -0.158492     -0.247318       -0.254766   

     native-country  workclass_Local-gov  workclass_Private  \
0     United-States                    0                  1   
1     United-States              

You should represent the outputs with 0 where the income is smaller or equal to 50K and with 1 if it is greater than 50K.

Follow exactly this approach, because the test targets are represented like this in the competition.

In [None]:


gender_mapping = {'Male': 1, 'Female': 0}
country_mapping = {"United-States" : 1, "Others" : 0}

# Use the 'replace' method to map the values in the 'gender' column
X_train2['sex'] = X_train2['sex'].replace(gender_mapping)
X_train2['native-country'] = X_train2['native-country'].replace(country_mapping)
print(X_train2)

           age  sex  capital-gain  capital-loss  hours-per-week  \
0    -0.976173    1     -0.158492     -0.247318       -0.254766   
1    -0.475061    1     -0.158492      4.015219        0.476653   
2    -0.642098    0     -0.158492     -0.247318       -0.254766   
3     0.527164    1     -0.158492     -0.247318       -0.254766   
4     0.360126    1     -0.158492     -0.247318        0.659508   
...        ...  ...           ...           ...             ...   
5195  1.445870    0     -0.158492     -0.247318       -0.254766   
5196 -1.393767    1     -0.158492     -0.247318       -0.254766   
5197  0.610682    1     -0.158492     -0.247318        2.030918   
5198  2.197538    1      2.839849     -0.247318       -0.254766   
5199  0.109570    1     -0.158492     -0.247318       -0.254766   

      native-country  workclass_Local-gov  workclass_Private  \
0                  1                    0                  1   
1                  1                    0                  0   
2  

In [None]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 10000)
pd.set_option('display.colheader_justify', 'center')

X_train2.head()

Unnamed: 0,age,sex,capital-gain,capital-loss,hours-per-week,native-country,workclass_Local-gov,workclass_Private,workclass_Self-emp-not-inc,workclass_State-gov,education_Bachelors,education_Doctorate,education_HS-grad,education_Masters,education_Some-college,marital-status_Divorced,marital-status_Married-civ-spouse,marital-status_Never-married,relationship_Husband,relationship_Not-in-family,relationship_Other-relative,relationship_Wife,race_Asian-Pac-Islander,race_Black,race_White
0,-0.976173,1,-0.158492,-0.247318,-0.254766,1,0,1,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,1
1,-0.475061,1,-0.158492,4.015219,0.476653,1,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,1,0
2,-0.642098,0,-0.158492,-0.247318,-0.254766,1,0,1,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,1,0
3,0.527164,1,-0.158492,-0.247318,-0.254766,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,1
4,0.360126,1,-0.158492,-0.247318,0.659508,1,0,1,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1


In [None]:
# class labels
l, _ = X_train.shape

y_train = np.zeros([l,], dtype=int)

for i in range(l):
  if df_train['income'].iloc[i] == '>50K':
    y_train[i] = 1
  elif df_train['income'].iloc[i] == '<=50K':
    y_train[i] = 0

In [None]:
print(y_train[0:5]) #print head

[0 1 0 0 1]


### Test set

In [None]:
test_file = 'datasets/adult_test.csv'

In [None]:
# load test set
df_test = pd.read_csv(PATH+test_file)
df_test.head()

Unnamed: 0,age,workclass,education,marital-status,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,33,Private,HS-grad,Never-married,Not-in-family,White,Male,3325,0,50,United-States
1,58,Private,HS-grad,Married-civ-spouse,Husband,White,Male,0,0,40,United-States
2,30,Self-emp-not-inc,HS-grad,Married-civ-spouse,Husband,White,Male,0,0,60,United-States
3,26,Private,Some-college,Never-married,Not-in-family,White,Female,0,0,20,United-States
4,43,State-gov,HS-grad,Never-married,Not-in-family,White,Male,0,0,60,United-States


In [None]:
df_test.describe()

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week
count,10402.0,10402.0,10402.0,10402.0
mean,39.811575,1280.969237,106.101038,42.749567
std,12.063746,7826.438595,438.826968,11.200949
min,18.0,0.0,0.0,1.0
25%,30.0,0.0,0.0,40.0
50%,38.0,0.0,0.0,40.0
75%,48.0,0.0,0.0,48.0
max,90.0,99999.0,3683.0,99.0


In [None]:
X_test = df_test.copy()

In [None]:
X_test.loc[:, ['age', 'capital-gain', 'capital-loss', 'hours-per-week']] = (X_test.loc[:, ['age', 'capital-gain', 'capital-loss', 'hours-per-week']] - X_test.loc[:, ['age', 'capital-gain', 'capital-loss', 'hours-per-week']].mean())/X_test.loc[:, ['age', 'capital-gain', 'capital-loss', 'hours-per-week']].std()
#Using oneHot encoding on categorical (non binary) features
X_test2 = pd.get_dummies(X_test, columns=['workclass', 'education', 'marital-status', 'relationship', 'race'])


gender_mapping = {'Male': 1, 'Female': 0}
country_mapping = {"United-States" : 1, "Others" : 0}

# Use the 'replace' method to map the values in the 'gender' column
X_test2['sex'] = X_test2['sex'].replace(gender_mapping)
X_test2['native-country'] = X_test2['native-country'].replace(country_mapping)
print(X_test2.head())


      age    sex  capital-gain  capital-loss  hours-per-week  native-country  workclass_Local-gov  workclass_Private  workclass_Self-emp-not-inc  workclass_State-gov  education_Bachelors  education_Doctorate  education_HS-grad  education_Masters  education_Some-college  marital-status_Divorced  marital-status_Married-civ-spouse  marital-status_Never-married  relationship_Husband  relationship_Not-in-family  relationship_Other-relative  relationship_Wife  race_Asian-Pac-Islander  race_Black  race_White
0 -0.564632   1     0.261170     -0.241783       0.647305            1                 0                   1                       0                       0                    0                    0                   1                  0                     0                       0                             0                                1                         0                        1                           0                       0                     0                  0   

You will need to prepare both training and test datasets before working with a Machine Learning method.

Consider you need to use some encoding method with categorical data.

You are free to use any other pre-processing ideas.

In [None]:
#Include your code here

Convert the datasets to NumPy to easily use them.

In [None]:
# data features
X_train = X_train2.to_numpy()
X_test = X_test2.to_numpy()


## GRAPE

<div>
<img src="https://drive.google.com/uc?export=view&id=1hw43Oi3lGTCkspQ0ged2bZB8q2EpcPhz" width="150"/>
</div>

GRammatical Algorithms in Python for Evolution (GRAPE)


In [None]:
!pip install deap

import grape
import algorithms

from os import path
from deap import creator, base, tools
import random
import csv



You can import functions to be used with your grammar from [functions.py](https://github.com/UL-BDS/grape/blob/main/functions.py) on GRAPE repository and / or you can define your own functions.

In [None]:
from functions import add, sub, mul, pdiv, psqrt, plog, neg, and_, or_, not_, less_than_or_equal, greater_than_or_equal

'heartDisease.bnf' is a grammar used for another problem just to check if everything is working well.

Write your own grammar in a text file and save it in your Drive account.

Put the whole address on GRAMMAR_FILE and print to check it.

In [None]:
%cd /content/drive/MyDrive/

/content/drive/MyDrive


In [None]:
pwd

'/content/drive/MyDrive'

In [None]:
GRAMMAR_FILE = 'adult.bnf' #put the whole address of your own grammar and remove the # in the beginning of this line
# GRAMMAR_FILE = 'heartDisease.bnf' #remove this line when you are using your own grammar

#f = open(GRAMMAR_FILE, "r") #remove the # in the beginning of this line when you are using your own grammar
f = open("" + GRAMMAR_FILE, "r") #remove this line when you are using your own grammar
print(f.read())
f.close()


<log_op> ::= <conditional_branches> | and_(<log_op>,<log_op>) | or_(<log_op>,<log_op>) | not_(<log_op>) | <boolean_feature>
<conditional_branches> ::= less_than_or_equal(<num_op>,<num_op>) | greater_than_or_equal(<num_op>, <num_op>)
<num_op>   ::= add(<num_op>,<num_op>) | sub(<num_op>,<num_op>) | mul(<num_op>,<num_op>) | pdiv(<num_op>,<num_op>) | <nonboolean_feature>
<boolean_feature> ::= x[1]|x[5]|x[6]|x[7]|x[8]|x[9]|x[10]|x[11]|x[12]|x[13]|x[14]|x[15]|x[16]|x[17]|x[18]|x[19]|x[20]|x[21]|x[22]|x[23]|x[24]
<nonboolean_feature> ::= x[0]|x[2]|x[3]|x[4]|<c><c>.<c><c>
<c>  ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9


Run the following cell to put your grammar on the class Grammar.

In [None]:
#BNF_GRAMMAR = grape.Grammar(GRAMMAR_FILE) #remove the # in the beginning of this line when you are using your own grammar
BNF_GRAMMAR = grape.Grammar(path.join("", GRAMMAR_FILE)) #remove this line when you are using your own grammar

The fitness function here is the percentage of outputs wrongly predicted.

You can write your own fitness function if you prefer.

In [None]:
def fitness_eval(individual, points):
    """
    Fitness Function
    """

    x = points[0]
    Y = points[1]

    if individual.invalid == True:
        return np.NaN,

    # Evaluate the expression
    try:
        pred = eval(individual.phenotype)
    except (FloatingPointError, ZeroDivisionError, OverflowError,
            MemoryError):
        return np.NaN,
    assert np.isrealobj(pred)

    try:
        Y_class = [1 if pred[i] > 0 else 0 for i in range(len(Y))]
    except (IndexError, TypeError):
        return np.NaN,

    compare = np.equal(Y,Y_class)
    fitness = 1 - np.mean(compare)

    return fitness,

To use properly the fitness function above with GRAPE, the features must be in the lines, and the samples must be in the columns, so if your data is not like that, you need to transpose the matrix.

Take a look at the print. If you run this cell two times, the matrix will be transposed again and will not work properly.

In [None]:
X_train = np.transpose(X_train)
X_test = np.transpose(X_test)

print('Training (X,Y):\t', X_train.shape, y_train.shape)
print('Test (X):\t', X_test.shape)

Training (X,Y):	 (25, 5200) (5200,)
Test (X):	 (25, 10402)


In [None]:
print()




Set the Grammatical Evolution parameters.

Make sure you set a random seed just in case we need to re-run your experiments.

In [None]:
POPULATION_SIZE = 1000
MAX_GENERATIONS = 100
P_CROSSOVER = 0.7
P_MUTATION = 0.1
ELITE_SIZE = 1
HALLOFFAME_SIZE = 1

TOURNAMENT_SIZE = 7
RANDOM_SEED = 42
random.seed(RANDOM_SEED)

CODON_CONSUMPTION = 'lazy'
GENOME_REPRESENTATION = 'list'
MAX_GENOME_LENGTH = None
HALL_OF_FAME_SIZE = 1

MAX_INIT_TREE_DEPTH = 10
MIN_INIT_TREE_DEPTH = 5
MAX_TREE_DEPTH = 17
MAX_WRAPS = 0
CODON_SIZE = 255

REPORT_ITEMS = ['gen', 'invalid', 'avg', 'std', 'min', 'max',
                'best_ind_length', 'avg_length',
                'best_ind_nodes', 'avg_nodes',
                'best_ind_depth', 'avg_depth',
                'avg_used_codons', 'best_ind_used_codons',
                'structural_diversity', 'fitness_diversity',
                'selection_time', 'generation_time']

Create a toolbox.

In [None]:
toolbox = base.Toolbox()

# define a single objective, minimising fitness strategy:
creator.create("FitnessMin", base.Fitness, weights=(-1.0,))

creator.create('Individual', grape.Individual, fitness=creator.FitnessMin)

toolbox.register("populationCreator", grape.sensible_initialisation, creator.Individual)

toolbox.register("evaluate", fitness_eval)

# Tournament selection:
toolbox.register("select", tools.selTournament, tournsize=TOURNAMENT_SIZE)

# Single-point crossover:
toolbox.register("mate", grape.crossover_onepoint)

# Flip-int mutation:
toolbox.register("mutate", grape.mutation_int_flip_per_codon)

In [None]:
# create initial population (generation 0):
population = toolbox.populationCreator(pop_size=POPULATION_SIZE,
                                           bnf_grammar=BNF_GRAMMAR,
                                           min_init_depth=MIN_INIT_TREE_DEPTH,
                                           max_init_depth=MAX_INIT_TREE_DEPTH,
                                           codon_size=CODON_SIZE,
                                           codon_consumption=CODON_CONSUMPTION,
                                           genome_representation=GENOME_REPRESENTATION
                                            )

# define the hall-of-fame object:
hof = tools.HallOfFame(HALL_OF_FAME_SIZE)

# prepare the statistics object:
stats = tools.Statistics(key=lambda ind: ind.fitness.values)
stats.register("avg", np.nanmean)
stats.register("std", np.nanstd)
stats.register("min", np.nanmin)
stats.register("max", np.nanmax)

Run Grammatical Evolution.

In [None]:
population, logbook = algorithms.ge_eaSimpleWithElitism(population, toolbox, cxpb=P_CROSSOVER, mutpb=P_MUTATION,
                                              ngen=MAX_GENERATIONS, elite_size=ELITE_SIZE,
                                              bnf_grammar=BNF_GRAMMAR,
                                              codon_size=CODON_SIZE,
                                              max_tree_depth=MAX_TREE_DEPTH,
                                              max_genome_length=MAX_GENOME_LENGTH,
                                              points_train=[X_train, y_train],
                                              codon_consumption=CODON_CONSUMPTION,
                                              report_items=REPORT_ITEMS,
                                              genome_representation=GENOME_REPRESENTATION,
                                              stats=stats, halloffame=hof, verbose=False)

gen = 0 , Best fitness = (0.2990384615384616,)
gen = 1 , Best fitness = (0.2990384615384616,) , Number of invalids = 203
gen = 2 , Best fitness = (0.2909615384615385,) , Number of invalids = 107
gen = 3 , Best fitness = (0.27057692307692305,) , Number of invalids = 95
gen = 4 , Best fitness = (0.26865384615384613,) , Number of invalids = 76
gen = 5 , Best fitness = (0.2548076923076923,) , Number of invalids = 70
gen = 6 , Best fitness = (0.24134615384615388,) , Number of invalids = 64
gen = 7 , Best fitness = (0.24134615384615388,) , Number of invalids = 64
gen = 8 , Best fitness = (0.24134615384615388,) , Number of invalids = 58
gen = 9 , Best fitness = (0.24134615384615388,) , Number of invalids = 43
gen = 10 , Best fitness = (0.24134615384615388,) , Number of invalids = 19
gen = 11 , Best fitness = (0.24134615384615388,) , Number of invalids = 58
gen = 12 , Best fitness = (0.24038461538461542,) , Number of invalids = 40
gen = 13 , Best fitness = (0.2382692307692308,) , Number of inv

Show the best individual as an expression.

In [None]:
# Best individual
import textwrap
best = hof.items[0].phenotype
print("Best individual: \n","\n".join(textwrap.wrap(best,80)))
print("\nTraining Fitness: ", hof.items[0].fitness.values[0])
print("Depth: ", hof.items[0].depth)
print("Length of the genome: ", len(hof.items[0].genome))
print(f'Used portion of the genome: {hof.items[0].used_codons/len(hof.items[0].genome):.2f}')

Best individual: 
 not_(or_(or_(or_(or_(x[12],or_(or_(x[20],or_(x[20],x[17])),not_(not_(and_(x[14],
or_(less_than_or_equal(x[0],mul(x[4],x[2])),x[17])))))),or_(not_(not_(x[17])),no
t_(greater_than_or_equal(x[3], x[3])))),or_(x[17],not_(not_(and_(x[14],or_(great
er_than_or_equal(mul(add(add(x[0],x[2]),mul(x[3],x[0])),sub(sub(x[4],x[2]),x[0])
), x[4]),x[9])))))),or_(or_(or_(or_(x[12],or_(or_(x[20],x[17]),x[17])),and_(x[14
],or_(less_than_or_equal(x[0],mul(x[4],x[2])),x[8]))),or_(and_(x[22],x[8]),or_(a
nd_(x[12],x[17]),or_(x[19],not_(greater_than_or_equal(x[3], x[3])))))),or_(x[17]
,not_(not_(and_(x[14],and_(x[21],not_(and_(x[14],or_(less_than_or_equal(x[0],mul
(x[3],x[3])),or_(x[12],or_(x[12],x[17])))))))))))))

Training Fitness:  0.2234615384615385
Depth:  17
Length of the genome:  559
Used portion of the genome: 0.29


Define a function to predict values, without comparing to expected outputs.

In [64]:
def predict(individual, X):
    x = X

    if individual.invalid == True:
        return np.NaN,

    # Evaluate the expression
    try:
        pred = eval(individual.phenotype)
        print(type(pred))
    except (FloatingPointError, ZeroDivisionError, OverflowError,
            MemoryError):
        return np.NaN,
    assert np.isrealobj(pred)

    predictions = pred.astype(int)

    return predictions

Predict the classes of the test set.

Make sure you print here in the notebook you will submit to Brightspace the same predictions you used in your best submission to the Kaggle competition.

In [65]:
#X_test = np.transpose(X_test)
print(X_test.shape)
y_pred = predict(hof.items[0], X_test)
print("Predicted classes of the test set: ", y_pred)



(25, 10402)
<class 'numpy.ndarray'>
Predicted classes of the test set:  [0 0 0 ... 1 0 0]


Write a code to create a .csv with the following format:
1. First column is the index (from 0 to 10401);
2. Second column is named `income` and contains the predictions (only 0's or 1's) you  got in the previous cell with y_pred.

Example:

    index,income

    0,0

    1,0

    2,1

    ...

    10401,0


Submit it to the competition and check your score there.

In [70]:
data = {
    'index': range(10402),  # The range is from 0 to 10401 (inclusive)
    'income': y_pred
}

df = pd.DataFrame(data)

# Save the DataFrame to a CSV file
df.to_csv('predictions.csv', index=False)

# df_train_new = pd.read_csv("predictions.csv")
# print(df_train_new.shape)


## Genetic Programming

If you do not want to use GE, you can use GP in this project.

In [None]:
from deap import gp
from deap import algorithms
#base, creator and tools were already imported

Set GP parameters.

In [None]:
POPULATION_SIZE =
P_CROSSOVER =
P_MUTATION =
MAX_GENERATIONS =
HALL_OF_FAME_SIZE =

RANDOM_SEED =
random.seed(RANDOM_SEED)

MIN_TREE_HEIGHT =
MAX_TREE_HEIGHT =
LIMIT_TREE_HEIGHT =
MUT_MIN_TREE_HEIGHT =
MUT_MAX_TREE_HEIGHT =

Next steps:

1. Define a fitness function;
2. Add functions and terminals to `gp.PrimitiveSet`. Maybe you will need to define some functions before that;
3. Create the toolbox;
4. Run GP;
5. Define a function to predict samples using a GP individual;
6. Predict the test set, save a .csv file and submit it in the Kaggle competition to see if your results are better than those from people using GE.

Hint:
    
    Maybe you will need to change the format of the data on X_train and X_test.

In [None]:
# your code here

Clean your notebook!

If you are using GE, remove the cells for GP, and vice-versa.