## Genetic Algorithm
_________
#### The Genetic Algorithm(GA) is an evolutionary algorithm(EA) inspired by Charles Darwin’s theory of natural selection which espouses Survival of the fittest. As per the natural selection theory, the fittest individuals are selected to produce offsprings. The fittest parents' characteristics are then passed on to their offsprings using cross-over and mutation to ensure better chances of survival. Genetic algorithms are randomized search algorithms that generate high-quality optimization solutions by imitating the biologically inspired natural selection process such as selection, cross-over, and mutation.

### Terminology for Genetic Algorithm
![](https://miro.medium.com/max/695/1*vIrsxg12DSltpdWoO561yA.png)
#### **Population** contains a set of possible solutions for the stochastic search process to begin. GA will iterate over multiple generations till it finds an acceptable and optimized solution. First-generation is randomly generated.
#### **Chromosome** represents one candidate solution present in the generation or population. A chromosome is also referred to as a Genotype. A chromosome is composed of Genes that contain the value for the optimal variables.
#### **Phenotype** is the decoded parameter list for the genotype that is processed by the Genetic Algorithm. Mapping is applied to the genotype to convert to a phenotype.
#### The **Fitness function** or the objective function evaluates the individual solution or phenotypes for every generation to identify the fittest members.
__________
### Different Genetic Operators
#### **Selection** is the process of selecting the fittest solution from a population, and then the fittest solutions act as parents of the next generation of solutions. This allows the next generation to inherit the strong features naturally. Selection can be performed using Roulette Wheel Selection or **Ranked Selection** based on the fitness value.

#### **Cross-over** or recombination happens when genes from the two fittest parents are randomly exchanged to form a new genotype or solution. Cross over can be a One-point cross over or Multi-Point Cross over based on the parent's segments of genes exchanged.
![image.png](attachment:e240e0f3-60da-44b4-81f7-16bb1e506ff5.png)
#### Here **One-point Cross-over** is used.
#### After a new population is created through selection and crossover, it is randomly modified through **mutation**. A **mutation** is a process to modify a genotype using a random process to promote diversity in the population to find better and optimized solutions.
![](https://miro.medium.com/max/385/1*bk6zF_rpgGi8IcPIY6fCWg.png)
______
### Usage of Genetic Algorithm in Artificial Intelligence
#### A Genetic Algorithm is used for Search and Optimization using an iterative process to arrive at the best solution out of multiple solutions.
#### 1. A Genetic Algorithm can find an appropriate set of hyperparameters and their values for a deep learning model to increase its performance in Deep Learning.
#### 2. A Genetic Algorithm can also be used to determine the best amount of features to include in a machine learning model for predicting the target variable.
____

### Working of Genetic Algorithm
![](https://miro.medium.com/max/598/1*TZ840m0DvghL80GodVGLeQ.png)
____

### Importing the required libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from random import randint
%matplotlib inline 
import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import train_test_split
def split(df,label):
    X_tr, X_te, Y_tr, Y_te = train_test_split(df, label, test_size=0.25, random_state=42)
    return X_tr, X_te, Y_tr, Y_te

from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold, cross_val_score

classifiers = ['LinearSVM', 'RadialSVM', 
               'Logistic',  'RandomForest', 
               'AdaBoost',  'DecisionTree', 
               'KNeighbors','GradientBoosting']

models = [svm.SVC(kernel='linear'),
          svm.SVC(kernel='rbf'),
          LogisticRegression(max_iter = 1000),
          RandomForestClassifier(n_estimators=200, random_state=0),
          AdaBoostClassifier(random_state = 0),
          DecisionTreeClassifier(random_state=0),
          KNeighborsClassifier(),
          GradientBoostingClassifier(random_state=0)]


def acc_score(df,label):
    Score = pd.DataFrame({"Classifier":classifiers})
    j = 0
    acc = []
    X_train,X_test,Y_train,Y_test = split(df,label)
    for i in models:
        model = i
        model.fit(X_train,Y_train)
        predictions = model.predict(X_test)
        acc.append(accuracy_score(Y_test,predictions))
        j = j+1     
    Score["Accuracy"] = acc
    Score.sort_values(by="Accuracy", ascending=False,inplace = True)
    Score.reset_index(drop=True, inplace=True)
    return Score

def plot(score,x,y,c = "b"):
    gen = [1,2,3,4,5]
    plt.figure(figsize=(6,4))
    ax = sns.pointplot(x=gen, y=score,color = c )
    ax.set(xlabel="Generation", ylabel="Accuracy")
    ax.set(ylim=(x,y))

In [None]:
def initilization_of_population(size,n_feat):
    population = []
    for i in range(size):
        chromosome = np.ones(n_feat,dtype=np.bool)     
        chromosome[:int(0.3*n_feat)]=False             
        np.random.shuffle(chromosome)
        population.append(chromosome)
    return population


def fitness_score(population):
    scores = []
    for chromosome in population:
        logmodel.fit(X_train.iloc[:,chromosome],Y_train)         
        predictions = logmodel.predict(X_test.iloc[:,chromosome])
        scores.append(accuracy_score(Y_test,predictions))
    scores, population = np.array(scores), np.array(population) 
    inds = np.argsort(scores)                                    
    return list(scores[inds][::-1]), list(population[inds,:][::-1]) 


def selection(pop_after_fit,n_parents):
    population_nextgen = []
    for i in range(n_parents):
        population_nextgen.append(pop_after_fit[i])
    return population_nextgen


def crossover(pop_after_sel):
    pop_nextgen = pop_after_sel
    for i in range(0,len(pop_after_sel),2):
        new_par = []
        child_1 , child_2 = pop_nextgen[i] , pop_nextgen[i+1]
        new_par = np.concatenate((child_1[:len(child_1)//2],child_2[len(child_1)//2:]))
        pop_nextgen.append(new_par)
    return pop_nextgen


def mutation(pop_after_cross,mutation_rate,n_feat):   
    mutation_range = int(mutation_rate*n_feat)
    pop_next_gen = []
    for n in range(0,len(pop_after_cross)):
        chromo = pop_after_cross[n]
        rand_posi = [] 
        for i in range(0,mutation_range):
            pos = randint(0,n_feat-1)
            rand_posi.append(pos)
        for j in rand_posi:
            chromo[j] = not chromo[j]  
        pop_next_gen.append(chromo)
    return pop_next_gen

def generations(df,label,size,n_feat,n_parents,mutation_rate,n_gen,X_train,
                                   X_test, Y_train, Y_test):
    best_chromo= []
    best_score= []
    population_nextgen=initilization_of_population(size,n_feat)
    for i in range(n_gen):
        scores, pop_after_fit = fitness_score(population_nextgen)
        print('Best score in generation',i+1,':',scores[:1])  #2
        pop_after_sel = selection(pop_after_fit,n_parents)
        pop_after_cross = crossover(pop_after_sel)
        population_nextgen = mutation(pop_after_cross,mutation_rate,n_feat)
        best_chromo.append(pop_after_fit[0])
        best_score.append(scores[0])
    return best_chromo,best_score

____
### Function Description
#### 1. split():
Splits the dataset into training and test set.
#### 2. acc_score():
Returns accuracy for all the classifiers.
#### 3. plot():
For plotting the results.
_____
### Function Description for Genetic Algorithm
#### 1. initilization_of_population():
To initialize a random population.
#### 2. fitness_score():
Returns the best parents along with their score.
#### 3. selection():
Selection of the best parents.
#### 4. crossover():
Picks half of the first parent and half of the second parent.
#### 5. mutation():
Randomly flips selected bits from the crossover child.
#### 6. generations():
Executes all the above functions for the specified number of generations
____
### The following 3 datasets are used:

1. Breast Cancer
2. Parkinson's Disease
3. PCOS
_____
### Plan of action:

* Looking at dataset (includes a little preprocessing)
* Checking Accuracy (comparing accuracies with the new dataset)
* Visualization (Plotting the graphs)
____

## Implementation of Genetic Algorithm for Feature Selection
________
#### First, we run a function to initialize a random population.
#### The randomized population is now run through the fitness function, which returns the best parents (highest accuracy).
#### Selection from these best parents will occur depending on the n-parent parameter.
#### After doing the same, it will be put through the crossover and mutation functions respectively.
#### Cross over is created by combining genes from the two fittest parents by randomly picking a part of the first parent and a part of the second parent.
#### The mutation is achieved by randomly flipping selected bits for the crossover child.
#### A new generation is created by selecting the fittest parents from the previous generation and applying cross-over and mutation.
#### This process is repeated for n number of generations.
______

____
# Breast Cancer
____

### 1. Looking at dataset

In [None]:
data_bc = pd.read_csv("../input/breast-cancer-wisconsin-data/data.csv")
label_bc = data_bc["diagnosis"]
label_bc = np.where(label_bc == 'M',1,0)
data_bc.drop(["id","diagnosis","Unnamed: 32"],axis = 1,inplace = True)

print("Breast Cancer dataset:\n",data_bc.shape[0],"Records\n",data_bc.shape[1],"Features")

In [None]:
display(data_bc.head())
print("All the features in this dataset have continuous values")

### 2. Checking Accuracy

In [None]:
score1 = acc_score(data_bc,label_bc)
score1

#### Choosing the best classifier for further calculations

In [None]:
logmodel = RandomForestClassifier(n_estimators=200, random_state=0)
X_train,X_test, Y_train, Y_test = split(data_bc,label_bc)
chromo_df_bc,score_bc=generations(data_bc,label_bc,size=80,n_feat=data_bc.shape[1],n_parents=64,mutation_rate=0.20,n_gen=5,
                         X_train = X_train,X_test = X_test,Y_train = Y_train,Y_test = Y_test)

#### We can see an improvement of 1-2%

### 3. Visualization

In [None]:
plot(score_bc,0.9,1.0,c = "gold")

_____
# Parkinson's disease
_____

### 1. Looking at dataset

In [None]:
data_pd = pd.read_csv("../input/parkinson-disease-detection/Parkinsson disease.csv")
label_pd = data_pd["status"]
data_pd.drop(["status","name"],axis = 1,inplace = True)

print("Parkinson's disease dataset:\n",data_pd.shape[0],"Records\n",data_pd.shape[1],"Features")

In [None]:
display(data_pd.head())
print("All the features in this dataset have continuous values")

### 2. Checking Accuracy

In [None]:
score3 = acc_score(data_pd,label_pd)
score3

In [None]:
logmodel = DecisionTreeClassifier(random_state=0)
X_train,X_test, Y_train, Y_test = split(data_pd,label_pd)
chromo_df_pd,score_pd=generations(data_pd,label_pd,size=80,n_feat=data_pd.shape[1],n_parents=64,mutation_rate=0.20,n_gen=5,
                         X_train = X_train,X_test = X_test,Y_train = Y_train,Y_test = Y_test)

#### We can see an improvement of 5-7%

### 3. Visualization

In [None]:
plot(score_pd,0.9,1.0,c = "orange")

____
# PCOS
____

### 1. Looking at dataset

In [None]:
data_pcos = pd.read_csv("../input/pcos-dataset/PCOS_data.csv")
label_pcos = data_pcos["PCOS (Y/N)"]
data_pcos.drop(["Sl. No","Patient File No.","PCOS (Y/N)","Unnamed: 44","II    beta-HCG(mIU/mL)","AMH(ng/mL)"],axis = 1,inplace = True)
data_pcos["Marraige Status (Yrs)"].fillna(data_pcos['Marraige Status (Yrs)'].describe().loc[['50%']][0], inplace = True) 
data_pcos["Fast food (Y/N)"].fillna(1, inplace = True) 

print("PCOS dataset:\n",data_pcos.shape[0],"Records\n",data_pcos.shape[1],"Features")

In [None]:
display(data_pcos.head())
print("The features in this dataset have both discrete and continuous values")

### 2. Checking Accuracy

In [None]:
score4 = acc_score(data_pcos,label_pcos)
score4

In [None]:
logmodel = RandomForestClassifier(n_estimators=200, random_state=0)
X_train,X_test, Y_train, Y_test = split(data_pcos,label_pcos)
chromo_df_pcos,score_pcos=generations(data_pcos,label_pcos,size=80,n_feat=data_pcos.shape[1],n_parents=64,mutation_rate=0.20,n_gen=5,
                         X_train = X_train,X_test = X_test,Y_train = Y_train,Y_test = Y_test)

#### We can see an improvement of 3-4%

### 3. Visualization

In [None]:
plot(score_pcos,0.9,1.0,c = "limegreen")

_______
## Note:
#### The "chromo_df" variable returns a list of np.array where we can see which features were selected in the Genetic algorithm (False represents the dropped features).
[array([ True,  True,  True, False,  True,  True, False, False, False,                  
        False,  True,  True,  True, False,  True, False, False,  True,                  
        False,  True, False,  True,  True,  True,  True,  True,  True,                  
        False, False, False]),               
        .                
        .                   
        .                  
        .          
        ]   
________

#### From looking at these results we can see a greater improvement in accuracy as compared to using methods such as Threshold Variance, Pearson Correlation, and F-score for feature selection.
#### Link to these methods:
##### [Variance Threshold](https://www.kaggle.com/tanmayunhale/feature-selection-variance-threshold)
##### [Pearson Correlation](https://www.kaggle.com/tanmayunhale/feature-selection-pearson-correlation)
##### [F-score](https://www.kaggle.com/tanmayunhale/feature-selection-f-score)
#### Reference Paper : [Genetic Algorithm Optimization Algorithm](https://pub.towardsai.net/genetic-algorithm-optimization-algorithm-f22234015113)
