<img src="images for report/unicamp.png" width="150" height="150">

# Introduction to Artificial Intelligence - MO416A

# Genetic Algorithm for Feature Selection

This work was completed by the following members:



*   Aissa Hadj - 265189
*   Lucas Zanco Ladeira - 188951
*   Matheus Ferraroni - 212142
*   Maria Vitória Rodrigues Oliveira - 262884
*   Oscar Ciceri - 164786

The original code of the project is located on a [repository inside Github](https://github.com/lucaslzl/ga_ia_p2) and the video showing the search strategies working is on [youtube](https://youtube.com). 



# I - Introduction

The problem that will be tackled in this project is Feature Selection. The goal is to obtain the subset of available features in a dataset that improves model performance by increasing its accuracy and decreasing its error rate. With the presence of irrelevant features in the dataset, more processing and memory requirements are necessary, thus wasting computing resources. To better understand the possible impacts of feature selection, we could cite the following pros: 

- Reducing Overfitting
- Improving the model Accuracy
- Reducing Training Time

Feature Selection may be done manually or automatically. Some manual techniques include univariate selection, feature importance, and the correlation matrix. The objective of the Univariate selection method is to statistically describe the relationship between each feature and the target. Also, feature importance generates a score for each feature to rank it. For instance, Decision Tree algorithms may rank features according to Gini impurity tests. Finally, the Correlation Matrix shows the correlation between pairs of features so that the features with a high correlation could be removed. The literature presents the usage of optimization techniques to automatically find the best (or a quite good) subset of features. Some of the methods include:

- <b>Exhaustive search</b>
- <b>Simulated Annealing</b>
- <b>Transformation Graph</b>
- <b>Genetic Algorithms</b>

<b>Exhaustive Search</b> is not an optimization technique, but it is worth to be mentioned as its computational complexity is $O(2^n)$. This technique tries every possible subset of features to find the best one. Due to its computational complexity, this technique is not practical in most cases. <b>Simulated Annealing</b> is a metaheuristic for complex nonlinear optimization problems and is analogous to the simulation of the annealing of solids. The analogy pairs are as follows: feasible solution (state), cost (energy), optimal solution (ground state), local search (rapid quenching), simulated annealing (careful annealing). On the other hand, <b>Transformation Graph</b> is a strategy that utilizes a tree-like structure to generate possible solutions. First, $n$ solutions are generated by removing at each one distinct feature. Second, all the solutions are evaluated. Third, the best solution is chosen and $n-1$ is generated by removing each feature yet not removed. That strategy goes on considering a budget. The issue of this strategy is that it requires a substantial amount of memory. Finally, <b>Genetic Algorithm</b> is inspired by genetics (DNA) to search through solutions. Its process can be described with the following steps: (1) it generates a population considering variations of the DNA, (2) then ranks the population according to some score, it applies some forms of mutations and other transformations to the DNA at each generation, and finally, (3) it iterates through the previous steps until it reaches a stopping rule. The stopping rule may be the number of generations produced or a target solution was reached. 

In this project, we decided to only tackle the the Genetic Algorithms strategy as it comprehends a big ground to cover already. That said, this report is structured as follows: in Section 2, we describe the implementation of the main parts of the genetic algorithm; in Section 3, we discuss the methodology we followed to undertake this project; in Section 4, we do a detailed analysis of the results; finally, in Section 5 we present the conclusion.

# II - Genetic Algorithm

We developed 2 classes to control the operation of the genetic algorithm. The classes are 'element' and 'GeneticAlgorithm' and can be seen in https://github.com/lucaslzl/ga_ia_p2/blob/master/GA.py. In this notebook, the class 'element' will be explained, as well as the main methods of the class 'GeneticAlgorithm'.

Below, the entire class 'element' is presented. It is possible to see that this class is only responsible for managing the id, the generation, the genome, and the score of each element of the population for every generation. Saving these attributes in the same place can be useful for different approaches during the implementation and use of the methods on the genetic algorithm. It is also possible to save the parents of each element and traceback how each element was formed during the evolution.

In [None]:
class element:

    def __init__(self, idd, geracao, genome):
        self.idd = idd
        self.geracao = geracao
        self.genome = genome
        self.score = None


    def __repr__(self):
        return "(id="+str(self.idd)+",geracao="+str(self.geracao)+",score="+str(self.score)+")"

The genetic algorithm implemented is very generic, this means that it can be applied easily to different problems. Using this kind of generic implementation makes it practical to override the functions responsible to generate a random genome, to mutate, and perform the fitness calculation.

The method 'create_initial_population' is called once the genetic algorithm is launched. This method is responsible for creating the individuals of the initial population and add them to the pool until the required population size is reached. To create the genome, the function 'random_genome', that was overwritten before, is called for each element.

In our problem, we define the genome as an array of 0's and 1's as elements with a length equal to the number of features in the dataset. This kind of approach used in the genome allows us to decode the genome as:

- Every bit of the genome corresponds to a feature in the dataset
- If the bit is 1: the feature corresponding to that bit is used
- If the bit is 0: the feature corresponding to that bit isn't used

In [None]:
def create_initial_population(self):
    for _ in range(self.population_size):
        self.population.append(element(self.elements_created, 0, self.random_genome()))
        self.elements_created += 1


def random_genome():
    return np.random.randint(low=0,high=2,size=len(df.columns),dtype=int)

The method 'run' is where the main loop of the genetic algorithm is executed. The steps are:

1. Check the stop criteria
2. Calculate the score for the current population
3. Sort the population according to the fitness value
4. Update the solution if a better result was found
5. Save the log
6. If set, part of the worst part the of the population can be discarded (This is not being used in the solutions found for this work)
7. Create a new population

In [None]:
def run(self):

    while self.check_stop():
        self.calculate_score() 
        self.population.sort(key=lambda x: x.score, reverse=True) 

        if self.best_element_total==None or self.population[0].score > self.best_element_total.score: 
            self.best_element_total = self.population[0]

        self.do_log()

        if self.cut_half_population: 
            self.population = self.population[0:len(self.population)//2] 

        self.new_population()

        self.iteration_counter +=1

    return self.best_element_total

The creation of a new population is implemented to be the same independently of how the selection of the parents is made. To achieve this, we create an array of probability that implements the rules for this selection based on the equal chance of each element or by roulette, where elements with higher fitness have higher chances to be selected.

The crossover method receives the parent's genome and returns a new genome-based on its genome.
This new genome is inserted into the new element and this new element is added to the new pool.

If we are recreating the entire population, this process repeats until the new population has the size of the population size limit. If we are replicating a percentage of the best elements, the amount of best elements being replicated is reduced in this process and they are replicated after this process.

In [None]:
def new_population(self):

    probs = self.get_probs()
    newPop = []
    best_replicator = int(self.population_size*self.replicate_best)

    while len(newPop)<self.population_size-best_replicator:
        parents = np.random.choice(self.population,size=2,p=probs) 

        if parents[0].score<parents[1].score: 
            parents = parents[::-1] 

        new_element = element(self.elements_created, self.iteration_counter, self.crossover(parents[0].genome, parents[1].genome))

        new_element.genome = self.active_mutate(new_element.genome)
        newPop.append(new_element)
        self.elements_created += 1

    for i in range(best_replicator):
        newPop.append(self.population[i])

    self.population = newPop

In order to define the probability of being selected as a parent, we implement 3 methods: "get_probs", "probs_equal" and "probs_roulette".

The "get_probs" just checks what kind of probability function must be used and calls the right one.
The "probs_equal" returns an array of probability where every element has the same chance of being selected.
The "probs_roullete" adds the fitness of each element to the array, sums its total, and divides the array by this sum. The result is an array where the best solutions have higher chances of being selected.

In [None]:
def get_probs(self):
    if self.probs_type == 0:
        return self.probs_roulette()
    elif self.probs_type == 1:
        return self.probs_equal()


def probs_equal(self):
    return [1/len(self.population)]*len(self.population)


def probs_roulette(self):
    probs = [0]*len(self.population) 
    for i in range(len(probs)):
        probs[i] = self.population[i].score
    div = sum(probs)

    if div!=0:
        for i in range(len(probs)):
            probs[i] /= div
    else: 
        probs = self.probs_equal()
    return probs

The method "check_stop" is responsible for checking which stops criteria must be used. The stop criteria can be used in 3 different ways.

The "stop_criteria_iteration" just returns "True" if a minimum amount of iterations has been attained.
The "stop_criteria_score" just returns "True" if any solution for a given generation achieved a minimum score.
The "stop_criteria_double" is a mix of the previous 2 methods.

In [None]:
def check_stop(self):
    if self.stop_criteria_type==0:
        return self.stop_criteria_double()
    elif self.stop_criteria_type==1:
        return self.stop_criteria_iteration()
    elif self.stop_criteria_type==2:
        return self.stop_criteria_score()

def stop_criteria_double(self):
    s = self.population[0].score
    if s==None:
        s = 0
    return self.iteration_counter<self.iteration_limit or s>=self.max_possible_score

def stop_criteria_iteration(self):
    return self.iteration_counter<self.iteration_limit

def stop_criteria_score(self):
    s = self.population[0].score
    if s==None:
        s = 0
    return s>=self.max_possible_score

We implemented 4 different crossovers and 1 function to decide which one must be used. The function "crossover" just receives the genome of 2 parents and calls the right genome function.

The "crossover_rate_selection" iterates the genome of both parents and selects from which one the bit must be selected. To define which one to select, this function checks a percentage that was defined previously. This means that the resulting genome can be 80% from one parent and 20% from the other, where the 80% comes from a parent with the highest score.

The "crossover_uniform" just selects the bits from the parent with a chance close to 50%/50% from the selection from each parent.

The "crossover_single_point" defines a random point in the middle of the genome and picks the first part from parentA and the second part from the parentB.

The "crossover_two_point" defines 2 random points in the middle of the genome and concatenates a part of the genomeA, a part from genomeB and a part from the genomeA using the points defined.

In [None]:
def crossover(self, genA, genB):
    if self.crossover_type==0:
        return self.crossover_uniform(genA, genB)
    elif self.crossover_type==1:
        return self.crossover_single_point(genA, genB)
    elif self.crossover_type==2:
        return self.crossover_two_point(genA, genB)
    elif self.crossover_type==3:
        return self.crossover_rate_selection(genA, genB)

def crossover_rate_selection(self, genA, genB):
    new = np.array([],dtype=int)
    for i in range(len(genA)):
        if np.random.random()<self.crossover_rate:
            new = np.append(new, genA[i])
        else:
            new = np.append(new, genB[i])
    return new


def crossover_uniform(self, genA, genB):
    new = np.array([],dtype=int)
    for i in range(len(genA)):
        if np.random.random()<0.5:
            new = np.append(new, genA[i])
        else:
            new = np.append(new, genB[i])
    return new


def crossover_single_point(self, genA, genB):
    p = np.random.randint(low=1,high=len(genA)-1) 
    return np.append(genA[0:p],genB[p:])


def crossover_two_point(self, genA, genB):
    c1 = c2 = np.random.randint(low=0,high=len(genA)) 
    while c2==c1: 
        c2 = np.random.randint(low=0,high=len(genA))

    if c1>c2: 
        c1, c2 = c2,c1

    new = np.append(np.append(genA[0:c1],genB[c1:c2]),genA[c2:]) 
    return new

The method "calculate_score" was implemented to be used synchronously or asynchronously and the user can define how to execute it before the start of the main loop.

This function just passes the genome of each element to a function that returns its fitness. In our case, we call the function "evaluate" that is responsible to decode the genome to use or not the features in the dataset and to check how good these features perform.

In [None]:
def calculate_score(self):
    if self.use_threads: 

        threads_running = []
        for e in self.population:
            x = threading.Thread(target=self.thread_evaluate, args=(e,))
            x.start()
            threads_running.append(x)

        for i in range(len(threads_running)):
            threads_running[i].join()

    else: 
        for e in self.population:
            e.score = self.evaluate(e.genome)

def thread_evaluate(self, e):
    e.score = self.evaluate(e.genome)
    

def evaluate(genome):
    bool_genome = list(map(bool, genome))
    return model.evaluate(df.loc[:, bool_genome].copy(), target)

The mutation method is called "active_mutate" and receives a single genome. This method iterates through the entire genome and creates a random value if this value is smaller than the one set in the initialization, then a mutation is started on that index.

We implemented 2 different mutations for this project:

- mutate1: This method implements the generative mutation, which randomly changes a gene.  The genes have binary values; thus, the selected gene changes the allele value for his complement.


- mutate2: This method implements the sequence swap combined with a generative mutation. First, a random position of the gene on a chromosome is selected. The genes located after this position are move to the beginning on the chromosome, and genes located before are move to the last. Moreover, the generative mutations technique is employed in the new chromosome. 

In [None]:
def active_mutate(self,gen):
    if self.mutation_rate<=0: 
        return gen
    for i in range(len(gen)): 
        if np.random.random()<self.mutation_rate: 
            gen = self.mutate1(i, gen) 
    return gen


def mutate1(index, genome):
    if genome[index]==0:
        genome[index] = 1
    else:
        genome[index] = 0
    return genome


def mutate2(index, genome):
    aux = []
    for i in range(len(genome)):
        if i <= index:
            aux.append(genome[i])
        else:
            aux.insert(0, genome[i])
    genome = aux
    if genome[index]==0:
        genome[index] = 1
    else:
        genome[index] = 0
    return genome

# III - Methodology

As already mentioned, in this project we apply genetic algorithm (GA) to help select a subset of features. The selection may help a data scientist to analyze data and improve the efficiency of machine learning models. To evaluate the GA strategy and each member of the population a supervised learning model was chosen. The model has a <i>target</i> that is the feature of the dataset to be predicted. Supervised learning models may be divided into two groups: Classification, comprehends the prediction of a categorical feature (class); and Regression, comprehends the prediction of a number value that does not describe a category. In this work, tabular classification datasets were obtained as there is a huge amount of them available openly. Also, there are well known pre-processing techniques that are easy to use. An example of a dataset may require to classify mushrooms according to their characteristics, so each type of mushroom comprehends a distinct class. With the datasets and model, GA parameters may be changed so that we obtain the best parameters to find a good subset of features. To summarize, the methodology comprehends the presentation of some data characteristics, datasets, classification model, parameters and configurations (group of parameters) and metrics.

## 1 - Tabular Data

We use the term "tabular data" to designate datasets composed of features that could take numerical values (for instance the income amount of a client with a checking account in a bank institution) or categorical values ("Yes" or "No" values describing if a client has a membership in a mileage program from an airline company for example). We chose to use only tabular data for some practical reasons, as such:
- There are a huge number of dataset available at [Kaggle](https://www.kaggle.com/), [OpenML](https://www.openml.org/), [UCI](https://archive.ics.uci.edu/ml/datasets.php).
- They are relatively easy to pre-process and clean.
- There are packages (i.e., Pandas) that are well suited to work with tabular data.
- In the literature they are the focus of the feature selection task.

## 2 -  Datasets

Bearing in mind that the datasets must be of a reasonable size (up to about 200MB) due to the limitation of computer power available at our disposal, the focus being on a classification problem, and the features being either numerical or categorical, a total of 9 different datasets were selected. The majority of them comes from the competition website Kaggle. The number of features from each dataset varies from 9 to over 500. That will help us evaluate the computing resources needed by the genetic algorithm, and its practicality in applying it to the real world. In particular, as we will see in more details in the results section, as we increase the number of features collected in a given dataset, the number of observations needed to build a relatively "good" machine learning model increases. This issue is termed as "the curse of dimensionality". We mention here the importance in varying the number of features since it impacts tremendously the amount of resources needed by the genetic algorithm in terms of time and computing power. The exact figures are shown in the table below.

Dataset name | Number of features | All possible feature combinations | Number of selected features with GA
--------- | --------- | --------- | --------- 
Glass | 9 | 512 | 5
Cellphone | 20 | 1,048,576 | 10
Mushrooms | 22 | 4,194,304 | 10
Airline Customer Satisfaction | 22 | 4,194,304 | 7
Kobe | 24 | 1,677,7216 | 9
Flag | 29 | 536,870,912 | 16
IBM | 34 | 17,179,869,184 | 17
Band | 37 | 137,438,953,472 | 23


## 3 - Classification Model

As the python language was used in this work, some libraries that already implemented machine learning models may be mentioned:
- [Scikit-learn](https://scikit-learn.org/stable/).
- [TensorFlow](https://www.tensorflow.org/).
- [PyTorch](https://pytorch.org/).

In these libraries many classification models are available, as such: Neural Networks, Decision Trees, Support Vector Machines and more. Genetic Algorithm has to train and test a model for each member of the population within each generation. That comprehends the necessity to choose a lightweight model according to the available time and computing power. With that said, Neural Networks, although they usually have a good result, demand a large computing power to train. In the case of Neural Networks a novel area called TinyML may be usefull to improve the competional efficiency of Neural Network models, however is out of the scope in this work. Support Vector Machines also require a large computing power as it tries to map data to high dimensional hiperplanes to separate data into the classes. Two models may be mentioned with a moderate computing power requirement, that are: Decision Tree and Random Forest. Decision Tree creates, as the name already implies, a decision tree ordered by a feature evaluation function. Usually Gini impurity is used as that function, which identifies the features that best separate the data.

The figure below is a good example on how Decision Tree works. "is the income over or below $30,000?" is a test performed on the feature called "Income" for instance. Each branch represents the outcome of the test, and each leaf node represents a class label. A decision about the class predicted is taken after performing a series of tests on features, until reaching a leaf node. The paths from the root to leaf represent classification rules.

<img width="450" height="450" src="images for report/decision tree example.png">

Random Forest model uses various Decision Trees to classify a target. The classification works like an election by using equal or distinct weights for each tree. Each Decision Tree is generated according to a random subset of features. The problem of using this model to Feature Selection task is that some features may not be used in any tree at all. Also, it is harder to find which features better describe the classes as they are randomly selected each time. That is one of the reasons, Decision Tree was chosen instead of Random Forest. More pros of this model may be mentioned:
- Lightweight model.
- It is easy to explain the decision tree generated.
- The evaluation function is already used for manual Feature Selection.

Although, we have to be aware of some cons too:
- As it is a simpler model it may not achieve the expected result.
- A large amount of features will require a large tree.

## 4 - Parameters

To evaluate the solutions to each modelling problem, we apply various configurations of the following parameters:

- **Population**: It is the total number of individuals in each generation. Its value is either 10 or 50 at each generation.


- **Iteration limit**: It is the maximum number of iterations. In this problem, the chosen values vary from 100 to 200. 


- **Stopping criterion**: It is the criterion that stops the genetic algorithm. The first stopping criterion examines the score of individuals (*) and the limit of interactions. The second criterion stops the algorithm once it reaches the maximum iteration limit. 


- **Crossover type**: Two types of recombination were employed.  The two-point crossover and the crossover according to the percentage of the crossover rate.


- **Crossover rate**: It defines the percentage of gene selection of the individual in the crossover. The chosen values are 0.8 and 0.5. 


- **Mutation type**: There can be two mutation types. Generativemethod, that inverts the values of the genome, and the sequence swap method combined with the generative method.


- **Mutation rate**: Percentage of individuals that must be mutated in each generation. The chosen values are 0.03 and 0.15.


- **Replicate best individuals**: It is the percentage of the best individuals that are replicated for the next generation. In this case, it can be 10% or 0%. in the last case the entire population is exchanged (exterminio method). 



(*) The score of individuals is equal to the F1 weighted score. It’s the evaluation metric we use to evaluate each classification model, and thus each individual from the population for all generations. The score varies between 0 and 1. The higher the value, the better the classification model. For more details about the F1 weighted score, please visit this page: 

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

Also, the Cross Validation strategy with KFold (K=10) is used.

## 5 - Configurations

The GA has an initial configuration, as shown below:

**Initial configurations (C0):**

- Population: 50.
- Maximum number of generations: 100.
- Stopping criterion: Maximum number of generations.
- Crossover type: Crossover rate selection.
- Crossover rate: 80%.
- Mutation type: Generative.
- Mutation rate: 3%.
- Replicate best individuals: 10%.

The initial configuration is the basis for the performance evaluation of the GA. Moreover, we modified the parameters of this initial configuration to analyze the impact of these variations on the fitness value. Each configuration below corresponds to a change on a single parameter from the initial configuration. For simplicity, we present below only the variation from the initial configuration above:

**Configuration 1 (C1):** 
- Population: 10.

**Configuration 2 (C2):** 
- Maximum number of generations: 200.

**Configuration 3 (C3):**
- Stopping criterion: Maximum number of generations and fitness value greater than 90%.

**Configuration 4 (C4):** 
- Crossover type: Two crossover points.

**Configuration 5 (C5):** 
- Crossover rate: 50%.

**Configuration 6 (C6):**
- Mutation type: SeqSwap with generative.

**Configuration 7 (C7):**
- Mutation rate: 15%.

**Configuration 8 (C8):**
- Replicate best individuals: 0%. 




## 6 -  Metrics

The genetic algorithm (GA), which was configured in nine variations, has been executed for a set of datasets to improve the fitness value (the F1 weighted score). To compare the different configurations of parameters of the genetic algorithm, we measure the three following metrics:

- **Maximum Fitness Value:** This metric is equal to the fitness value of the chromosome which has the greatest aptitude among the population for a given generation. 


- **Average Fitness Value:** This metric is the mean value of the fitness value of all the chromosomes inside the population at each generation.  


- **Minimum Fitness Value:** This metric is equal to the fitness value of the chromosome, which has the lowest aptitude among the population for a given generation. 


## 7 - Comparison with base machine learning model

In order to evaluate the feature selection provided by the genetic algorithm (GA), we compare the model according to GA to the base decision tree model that includes all the features available in each dataset. Since the objective of the project is to evaluate the genetic algorithm on the feature selection topic, the machine learning model (decision tree method) is fixed for all the datasets. For that reason, it would be unfair to compare the results given by the genetic algorithm to final scores that competitors on Kaggle website came up with since they chose more complex machine learning models than the decision tree technique.


# IV - Simulation Results and Discussion


The figures presented in this section show the values derived from a simulation employing the different configurations of the GA algorithm. We compared each configuration in terms of maximum, average, and minimum fitness values. Moreover, the best GA configuration for each dataset is also analized. 

Looking at the results from all the datasets, we first notice that it takes about 30 generations before the average fitness value of all the individuals stabilizes at a maximum value. This statement is true for all the different configurations. The genetic algorithm thus selects the best combination of features relatively fast. 


## Results overview


Before doing a detailed analysis of the genetic algorithm, we briefly show some of the best genomes found by the genetic algorithm for some datasets with different configurations. 


In each figure below, the numbers from 0 to 99, on the vertical axis, correspond to the generation number from bottom to top. On the horizontal axis, we list the features in the order they appear in the datasets (from left to right). A dark color corresponds to a feature that was included in the decision tree model by the genetic algorithm, for a given generation.

<table><tr><td><img src='https://raw.githubusercontent.com/lucaslzl/ga_ia_p2/master/plots/genomeSequencePlots/glass.csv_50_100_0_0_3_0.8_0_0.03_False_False_0.1.json.png' width="450" height="400"></td><td><img src='https://raw.githubusercontent.com/lucaslzl/ga_ia_p2/master/plots/genomeSequencePlots/bands.csv_50_100_1_0_3_0.8_0_0.15_False_False_0.1.json.png' width="450" height="400"></td></tr>

<tr>
<td><img src='https://raw.githubusercontent.com/lucaslzl/ga_ia_p2/master/plots/genomeSequencePlots/IBM.csv_50_200_1_0_3_0.8_0_0.03_False_False_0.1.json.png' width="450" height="400"></td>
<td><img src='https://raw.githubusercontent.com/lucaslzl/ga_ia_p2/master/plots/genomeSequencePlots/flag.csv_50_100_1_0_3_0.8_1_0.03_False_False_0.1.json.png' width="450" height="400"></td>
</tr>
</table>

On the top row, from left to right, for the Glass dataset, we try to predict the glass type given the features. On the right plot, we present the best genome at each generation for Bands datasets, for a given configuration of parameters.
On the second row, on the left image: we show the genome of the best chromosome that best predict whether an employee will leave the company (IBM dataset). For the image on the right, we classify flags by region.



## IBM Dataset

The fitness values of the best chromosomes and the average fitness value of all chromosomes for each GA configuration employing the IBM dataset are shown in the following figures.

<table><tr><td><img src='https://raw.githubusercontent.com/lucaslzl/ga_ia_p2/master/plots/IBM_max.png' width="450" height="300"></td><td><img src='https://raw.githubusercontent.com/lucaslzl/ga_ia_p2/master/plots/IBM_avg.png' width="450" height="300"></td></tr></table>


Simulation results show that the configuration 5 (C5) produces the highest fitness value among all the configurations. However, the C5 generates a considerable increase in the maximum fitness value when the number of generations is 30. This means that a chromosome, out of average, was generated.

Moreover, the C0 and C3 configurations produce highly average fitness values compared to the other configurations. C0 and C3 produce the same behaviors because the stop criterion is the only parameter that was changed, which does not affect the fitness values. Moreover, C6 produces one of the least diverse individuals for all generations, and its average value is just 3% below its maximum value for the best individual.

The maximum, average and minimum values for the best configurations (C5) of the GA in the IBM dataset are shown in the following figure.

<img src="https://raw.githubusercontent.com/lucaslzl/ga_ia_p2/master/plots/IBM_best_config.png" width="450" height="300">

The configuration C5 produces a fitness value of 48 %, which is 20% higher than the lowest fitness value. Moreover, average fitness values are constant starting from generation 40, which means we could have stopped the algorithm earlier and prevented a waste of computation resources. Moreover, a maximum fitness value of 48%, for all configurations, possibly means that the GA drops in a local maximum. This can occur due to the best individuals monopolizing the selection, which is a consequence of an inadequate normalization.


<table><tr><td><img src='https://raw.githubusercontent.com/lucaslzl/ga_ia_p2/master/plots/bestgenomePlots/IBM.csv_50_100_1_0_3_0.5_0_0.03_False_False_0.1.json.png' width="450" height="300"></td><td><img src='https://raw.githubusercontent.com/lucaslzl/ga_ia_p2/master/plots/bestgenomePlots/IBM.csv_50_100_1_0_3_0.8_1_0.03_False_False_0.1.json.png' width="450" height="300"></td></tr></table>

On the left image, we present the best chromosome at each generation with the C5 configuration. On the right image, we show the least "diverse" best genome sequence with the C6 configuration.


 ##  Bands Dataset

The fitness values of the best chromosomes and the average fitness value of all chromosomes for each GA configuration employing the bands dataset are shown in the following figures.

<table><tr><td><img src='https://raw.githubusercontent.com/lucaslzl/ga_ia_p2/master/plots/bands_max.png' width="450" height="300"></td><td><img src='https://raw.githubusercontent.com/lucaslzl/ga_ia_p2/master/plots/bands_avg.png' width="450" height="300"></td></tr></table>


Simulation results show that the C3 configuration has the highest Fitness within 100 iterations. However, the C2 has a limit of 200 iterations and exceeds the Fitness values of all configurations after 175 iterations. Moreover, the C0 and C6 configurations obtained Fitness values lower than the other configurations. This is because the C0 has a smaller population, just 10 individuals, and the C6 configuration uses sequential swap mutation with generation, which causes less diversity between individuals.

The maximum, average and minimum values for the best configurations (C2) of the GA in the Bands dataset are shown in the following figure. 

<img src="https://raw.githubusercontent.com/lucaslzl/ga_ia_p2/master/plots/bands_best_config.png" width="450" height="300">


Considering the C2 configuration, the maximum Fitness value of Bands dataset is 0.85, and the minimum is 0.64, which represent 85% and 64%, respectively. Besides, the average values begin to stabilize after 50 iterations.




<table><tr><td><img src='https://raw.githubusercontent.com/lucaslzl/ga_ia_p2/master/plots/bestgenomePlots/bands.csv_50_100_0_0_3_0.8_0_0.03_False_False_0.1.json.png' width="450" height="300"></td><td><img src='https://raw.githubusercontent.com/lucaslzl/ga_ia_p2/master/plots/bestgenomePlots/bands.csv_50_200_1_0_3_0.8_0_0.03_False_False_0.1.json.png' width="450" height="300"></td></tr></table>


On the left image above, we present the best genome with the C3 configuration. On the right image, we show the best genome sequence with the C2 configuration, containing 200 generations.


 ##  Flag Dataset 

The fitness values of the best chromosomes and the average fitness value of all chromosomes for each GA configuration employing the flag dataset are shown in the following figures.

<table><tr><td><img src='https://raw.githubusercontent.com/lucaslzl/ga_ia_p2/master/plots/flag_max.png' width="450" height="300"></td><td><img src='https://raw.githubusercontent.com/lucaslzl/ga_ia_p2/master/plots/flag_avg.png' width="450" height="300"></td></tr></table>



Simulation results show that the configuration 6 (C2) produces the highest fitness value compared to the other configurations.
However, this fitness value is slightly higher than those produced by the other configurations, in the best case, c6 produces a fitness 3% higher than the worst configuration (C8). Thus, all configurations produce almost the same results. However, the configuration C8 produces the lowest average fitness values since the new population is created by extermination method, which increases the variability of the best chromosome.

The maximum, average and minimum values for the best configurations (C2) of the GA in the bands dataset are shown in the following figure.

<img src="https://raw.githubusercontent.com/lucaslzl/ga_ia_p2/master/plots/flag_best_b_a_m.png" width="450" height="300">



The C2 configuration produces a high diversity in the chromosomes, which is observed in the chromosome with the least fitness value. The fitness values of these chromosomes vary from 30% to 70%  in all generations.  Moreover, the average fitness value converges after 15 generations. Also, the fitness value of the best chromosome has the highest increase until that generation. After generation 15, the increase in the value of fitness is approximately 3%.



On the plot below, we show the best genome sequence with the C2 configuration, containing 200 generations.

<img src='https://raw.githubusercontent.com/lucaslzl/ga_ia_p2/master/plots/bestgenomePlots/flag.csv_50_200_1_0_3_0.8_0_0.03_False_False_0.1.json.png' width="450" height="300">




 ##  Cellphone Dataset 


The fitness values of the best chromosomes and the average fitness value of all chromosomes for each GA configuration employing the Cellphone dataset are shown in the following figures.


<table><tr><td><img src='https://raw.githubusercontent.com/lucaslzl/ga_ia_p2/master/plots/cellphone_max.png' width="450" height="300"></td><td><img src='https://raw.githubusercontent.com/lucaslzl/ga_ia_p2/master/plots/cellphone_avg.png' width="450" height="300"></td></tr></table>



It is interesting to note that in the Cellphone dataset, the maximum values of Fitness for all configurations were very similar. Moreover, the average of the C6 and C7 configurations was lower than the average of the others. However, they all stabilized after around 20 iterations. Also, the C3 configuration showed the best behavior.

The maximum, average and minimum values for the best configuration (C3) of the GA in the bands dataset are shown in the following figure.


<img src="https://raw.githubusercontent.com/lucaslzl/ga_ia_p2/master/plots/cellphone_best_config.png" width="450" height="300">


The C3 configuration produces the highest Fitness value using the stop criterion Maximum Number of Generations and fitness value greater to 90%. In this configuration, the fitness values of minimum chromosomes vary from 30% to 60%, without stabilization. Despite that, the average fitness values start to stabilize after 10 iterations.


The image below presents the best chromosome with the C3 configuration at each generation.

<img src="https://raw.githubusercontent.com/lucaslzl/ga_ia_p2/master/plots/bestgenomePlots/cellphone.csv_50_100_0_0_3_0.8_0_0.03_False_False_0.1.json.png" width="450" height="300">



 ##  Mushrooms Dataset 


The fitness values of the best chromosomes and the average fitness value of all chromosomes for each GA configuration employing the mushrooms dataset are shown in the following figures.


<table><tr><td><img src='https://raw.githubusercontent.com/lucaslzl/ga_ia_p2/master/plots/mushrooms_max.png' width="450" height="300"></td><td><img src='https://raw.githubusercontent.com/lucaslzl/ga_ia_p2/master/plots/mushrooms_avg.png' width="450" height="300"></td></tr></table>

The GA, in all configurations, produces the maximum fitness value after a few generations, which means the dataset has a low complexity. Moreover, the stop criteria employed in the configuration generated a high waste of computational resources since the genetic algorithm could have stoppoed after a few iterations.


The image below presents the best chromosome with the C3 configuration at each generation.

<img src="https://raw.githubusercontent.com/lucaslzl/ga_ia_p2/master/plots/bestgenomePlots/mushrooms.csv_50_100_0_0_3_0.8_0_0.03_False_False_0.1.json.png" width="450" height="300">


## Airline Customer Satisfaction Dataset


The fitness values of the best chromosomes and the average fitness value of all chromosomes for each GA configuration employing the Airline Customer Satisfaction dataset are shown in the following figures.

<table><tr><td><img src='https://raw.githubusercontent.com/lucaslzl/ga_ia_p2/master/plots/airline_customer_satisfaction_max.png' width="450" height="300"></td><td><img src='https://raw.githubusercontent.com/lucaslzl/ga_ia_p2/master/plots/airline_customer_satisfaction_avg.png' width="450" height="300"></td></tr></table>

The genetic algorithm produced relatively good results on the dataset “Airline Customer Satisfaction” in general for all the combinations in parameters.  Looking at the figure above titled “Airline Customer Satisfaction - Maximum”, all the scenarios generate a “best” chromosome with a fitness value above 0.94 from generation 20 (except for scenarios C5 and C7 where we need to wait until generation 30). Scenario C5 produces the “least good” chromosome with a final fitness value of 94%. In fact with a crossover rate of 50%, meaning that creating new generations where the new chromosome is composed 50-50 from the chromosomes of the two parents isn’t a good strategy. If any slight increase in fitness value is important, then C3 produces the best chromosome, among the best chromosomes from all the scenarios, with a fitness value of about 94.5% starting at generation 65. The strategy of taking 80% of the gene sequence from the best parent and 20% from the other parent to produce the new chromosome was the best one. Some specific features contained in the best parent explain mostly the classification of the target. We also note here the impact of randomness on the results. In fact, C0 and C2 have the same strategy as C3, the differences in the fitness values having nothing to do in the creation of new generations, but to do with the randomness in the mutation and crossover processes. C2 and C4 eventually produce chromosomes reaching that fitness level, from generations 60 and 90 respectively.



<table><tr><td><img src='https://raw.githubusercontent.com/lucaslzl/ga_ia_p2/master/plots/bestgenomePlots/airline_customer_satisfaction.csv_50_100_0_0_3_0.8_0_0.03_False_False_0.1.json.png' width="450" height="300"></td><td><img src='https://raw.githubusercontent.com/lucaslzl/ga_ia_p2/master/plots/bestgenomePlots/airline_customer_satisfaction.csv_50_100_1_0_3_0.5_0_0.03_False_False_0.1.json.png' width="450" height="300"></td></tr></table>


On the left image above, we present the best genome with the C3 configuration. On the right image, we show the best genome sequence with the C5 configuration, giving the least good fitness value.


## Result summarization according to the F1-score

In this section, we compare the results for each dataset without selection and with selection (according to each distinct configuration). We define “without selection” as building the decision tree model with all the features present in the datasets. The figure below shows the histograms of fitness values “without selection” against the results discussed earlier. For the “IBM” dataset, removing some features decreases the quality of the model. For the "Airline Customer Satisfaction" problem, feeding all the features or just the selected features from the GA makes a very small difference (at most 1% of difference in the F1 score). However, for all the other datasets, we notice that feeding all the features to the decision tree model isn’t a good idea. In fact, feature selection provided by the genetic algorithm improved the models’ score from 1% to more than 10%. That underlines the capability of the strategy to select the best features from the datasets to classify the target. Even if only a slight improvement was gained, notably for the dataset “cellphone”, removing some features from the datasets has some benefits on many levels. It decreases the size of the datasets in terms of features and observations, and reduces the processing and memory requirements to train and use the model. It also decreases the response time of the model when deployed in production.


<img src="https://raw.githubusercontent.com/lucaslzl/ga_ia_p2/master/plots/res_classification.png" width="800" height="300">




In the table below, we show the numbers of features finally selected by the genetic algorithm that give the best decision tree model for each classification problem.

Dataset name | Number of features | Number of selected features with GA | Features selected as % of all features 
--------- | --------- | --------- | --------- |
Glass | 9 |  5 |  55%
Cellphone | 20  | 10 |  50%
Mushrooms | 22  | 10 |  46%
Airline Customer Satisfaction | 22  | 7 |  32%
Kobe | 24 | 9 |  38%
Flag | 29  | 16 |  55%
IBM | 34 | 17 |  50%
Band | 37 | 23 |  62%

## All the other datasets

At the time of writing of this report, the results of the feature selection provided by the genetic algorithm for the other datasets, listed in the table in the methodology section, weren't yet available for analysis.


## Discussions

**Time of execution of the GA per number of features**

For a relatively high number of features (50 and above), the execution time of the genetic algorithm increases very fast. For this reason, it wasn’t possible to show the results for the Human Activity Classification dataset, its total number of features being above 500. We should also add that the decision tree model is relatively fast to train. Combining the genetic algorithm with other artificial intelligence models such as neural networks would require simulation time of much higher magnitude.

**Importance on the choice of configurations**

As we saw during the analysis of the results, different configurations produce different results in terms of fitness values. Thus, it’s not possible to select which scenario specifically is the best one for all the datasets. There isn’t a “one size fits all” scenario. In addition, due to the randomness of the mutation and crossover processes, we recommend running the genetic algorithm a couple of times in order to get the best feature selection in terms of fitness value. 

**Limited results for some datasets**

Looking at the plots for all the datasets, the results are quite limited for some classification problems (the “Kobe” dataset for instance).  It’s important to point out that the limited results aren’t due to the genetic algorithm but related to the choice of the decision tree model or to the choice of features selected. Therefore, the strategy in this case might be to explore other possible features that might impact in some ways the target to classify. If that option was already explored then we can conclude that the decision tree classification model might not be the best machine learning technique for the problem at hand. Thus, they can save time and try a more complex classification technique such as k-clustering, SVM, or neural networks for example. 


# V - Conclusions

As we can see from this project, combining a machine learning method with  the genetic algorithm in order to find the best combinations of features, from all the features available in a dataset, can be an interesting strategy. 
The genetic algorithm is very efficient at selecting, combining and trying various subsets of features in order to improve the fitness value and build the best classification model.

For the vast majority of the datasets in our project, simply feeding all the features to the decision tree model produces limited results. The genetic algorithm is able to select the most appropriate features with the end goal of improving the fitness value of the machine learning model. In our case, the fitness value is the F1 score, but the algorithm can be easily adapted to cater to any other fitness function, or to another modeling problem such as regression. 

Moreover, due to "the curse of dimensionality", the more we include features in the datasets, the more we need observations for model building and the higher the dataset size in terms of memory size. This has a significant impact on the execution time for the genetic algorithm. However, we should temper that fact by remembering that we need to execute the genetic algorithm only once for feature selection. On another note, we could improve the stop criterion of the genetic algorithm to make it stop earlier when, after a reasonable amount of iterations, we see no improvement in the fitness value.

In addition, the decision tree model is relatively easy and fast to build. If someone isn't sure which machine learning model to use, then the algorithm of this project might give them an answer whether the decision tree model is appropriate to the datasets and the problem at hand. If the fitness values found are limited, then either the features selected aren't adequate, and someone might want to collect other features, or they might select a more complex machine learning model.





# VI - Sources

***Datasets used in the project:***

Airline Customer Satisfaction: https://www.kaggle.com/teejmahal20/airline-passenger-satisfaction

Kobe: https://www.kaggle.com/c/kobe-bryant-shot-selection/data

IBM: https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset

Human Activity Classification: https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones?select=train.csv

Flag: https://archive.ics.uci.edu/ml/datasets/Flags

Mushrooms: https://www.kaggle.com/uciml/mushroom-classification

Glass: https://www.kaggle.com/uciml/glass

Bands: https://archive.ics.uci.edu/ml/datasets/Cylinder+Bands

Cellphones: https://www.kaggle.com/iabhishekofficial/mobile-price-classification


***Link to plots:***

Plots of the fitness values in function of the configuarion method and generation: https://github.com/lucaslzl/ga_ia_p2/tree/master/plots

Plots of the best genome sequences in function of the configuration and generation: https://github.com/lucaslzl/ga_ia_p2/tree/master/plots/bestgenomePlots

# Appendix

## **Link to the Github repository**

[https://github.com/lucaslzl/ga_ia_p2](https://github.com/lucaslzl/ga_ia_p2)

## **Link to the video**

https://www.youtube.com

## **How to execute the files**

There are two ways to execute the experiments:

**1. Execute the python code called "main.py" and pass the parameters**

The code considers the parameters to execute each strategy and map. It is possible to change what is the stop criteria, crossover type and more.

Run "python3 main.py --(strategy flag)" to execute.

For instance, "python main.py --population=50 --dataset=path --iteration_limit=100 --stop_criteria=1 --probs_type=0 
--crossover_type=3 --crossover_rate=0.8 --mutation_type=0 --mutation_rate=0.03 --use_threads=0 --cut_half_pop=0
--replicate_best=0.1" executes a single feature selection with the mentioned parameters.

It is worth to point out that you may run "python main.py --help" to visualize all the possible flags and parameters.

**2. Execute the shellcode called "execute.sh"**

The code executes every configuration for each dataset and saves the results.

run "./execute.sh" and wait until it is done.

All results are saved in the results folder.

Good idea to use virtual env. Tested on Python 3.8

## Individual Contributions

**Planning**
- **Genetic Algorithm**: Matheus Ferraroni
- **Methodology**: Entire Group
- **Experiments**: Oscar Ciceri, Maria Vitória R. Oliveira

**Code**
- **Genetic Algorithm**: Matheus Ferraroni Sanches, Oscar Ciceri
- **Machine Learning**: Lucas Zanco Ladeira, Aissa Hadj
- **plots**: Maria Vitória R. Oliveira, Aissa Hadj

**Report**
- **I Introduction**: Lucas Zanco Ladeira
- **II Genetic Algorithm**: Matheus Ferraroni Sanches
- **III Methodology**: Lucas Zanco Ladeira, Aissa Hadj, Oscar Ciceri, Maria Vitória R. Oliveira
- **IV Simulation Results**: Maria Vitória R. Oliveira, Oscar Ciceri, Aissa Hadj, Lucas Zanco Ladeira 
- **V Conclusions**: Aissa Hadj
- **VI Sources**: Lucas Zanco Ladeira, Aissa Hadj