# Genetic Programming

### Idea of the algorithms
#### Template 1: Using text embedding with the BERT Transformer:

1. Select text columns from the dataframe, encode them with the BERT model;
2. Imputing NaN values, if any;
3. Dimensionality reduction with PCA;
4. Using TPOT to find the most appropriate classification algorithm.

#### Template 2: Using tokenization and text embedding with the Spacy model:

1. Select text columns from the dataframe and tokenize them;
2. Removing punctuation signs and stop words from tokens;
3. Using lemmatization to simplify a sentence;
4. Vectorization of simplified sentences using one of the Spacy models;
5. Imputing NaN values, if any;
6. Dimensionality reduction with PCA;
7. Using TPOT to find the most appropriate classification algorithm.

---

Unfortunately, I could not put both algorithms in TPOT due to the high resource limitation. Therefore, I ran both algorithms separately and then compared them on balanced accuracy on a test dataset to determine the best pipeline. In addition, for the reason of the resource limitation, I could not set a large number of generations and individuals within TPOT, so the accuracy value is not very high.

### Step 1: Importing packages and loading data 

In [None]:
# # Some extra packages should be installed

# !pip install sentence_transformers
# !pip install tpot
# !pip install spacy

In [None]:
import numpy as np
import pandas as pd
import copy
import spacy
import warnings
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from tpot import TPOTClassifier
from tpot.config import classifier_config_dict_light
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.metrics import balanced_accuracy_score
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.preprocessing import StandardScaler
from sentence_transformers import SentenceTransformer

# nltk.download('wordnet')
# nltk.download('punkt')

warnings.filterwarnings("ignore")

In [4]:
# Importing data
df = pd.read_csv('GP data.csv')
df

Unnamed: 0.1,Unnamed: 0,id,qid1,qid2,question1,question2,bot?,number_likes,is_duplicate
0,0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,False,30634,0
1,1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,False,64192,0
2,2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,False,1396,0
3,3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,False,24049,0
4,4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,False,13076,0
...,...,...,...,...,...,...,...,...,...
1995,1995,1995,3969,3970,"I am visiting Sri Lanka soonfor 9 days, how ca...",Do Indians hate Sri Lankans?,False,24454,0
1996,1996,1996,3971,3972,What are some good examples of 4 stanza poems?,What are some good Ilocano poems?,False,2611,0
1997,1997,1997,3973,3974,Which CPU is better I3 4th Gen or 6th Gen?,Which is better intel i5 (6th gen) or i7 (5th ...,True,18483,0
1998,1998,1998,3975,3976,What are some of the best tourist places to vi...,Where are the foremost tourist places in Chhat...,True,28018,1


In [5]:
# Dropping ID columns
new_df = df.drop(['id','qid1','qid2', 'Unnamed: 0'], axis=1)
new_df

Unnamed: 0,question1,question2,bot?,number_likes,is_duplicate
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,False,30634,0
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,False,64192,0
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,False,1396,0
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,False,24049,0
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,False,13076,0
...,...,...,...,...,...
1995,"I am visiting Sri Lanka soonfor 9 days, how ca...",Do Indians hate Sri Lankans?,False,24454,0
1996,What are some good examples of 4 stanza poems?,What are some good Ilocano poems?,False,2611,0
1997,Which CPU is better I3 4th Gen or 6th Gen?,Which is better intel i5 (6th gen) or i7 (5th ...,True,18483,0
1998,What are some of the best tourist places to vi...,Where are the foremost tourist places in Chhat...,True,28018,1


In [6]:
# Separating data into targets and features
y = new_df['is_duplicate']
X = new_df.drop(['is_duplicate'], axis=1)

### Step 2: BERT transformer 

In [None]:
class BertTransformer(TransformerMixin, BaseEstimator):
    """
    Transformation algorithm
    
    Transformation is performed by the BERT model
    """
    def __init__(self):
        """
        Initializer of the class
        
        Contains a preloaded BERT model
        """
        # Preloaded BERT model
        self.sbert_model = SentenceTransformer('bert-base-nli-mean-tokens')
    
    def fit(self, X, y=None):
        """
        Model is already fitted

        :param X: pandas dataframe of features
        :param y: pandas dataframe of targets
        :return: current instance of the transformer class
        """
        return self
    
    def transform(self, X):
        """
        Embedding of text columns by BERT

        :param X: pandas dataframe to be transformed
        :return: transformed pandas dataframe
        """
        X_copy = X.copy()
        new_data = pd.DataFrame()
        for i, column in enumerate(X_copy.columns):
            if type(X_copy[column].iloc[0])==str:
                # Extracting text features from the dataframe, 
                # encoding them by BERT encoder
                new_column = self.sbert_model.encode(X_copy[column].to_list())
                new_column = pd.DataFrame(new_column)

                # Store encoded features at new dataframe
                for j, col in enumerate(new_column):
                    new_data['Feature '+str(i)+"."+str(j)] = new_column[col]
            else:
                # Non-text features should be stored at new dataframe whithout any changes
                new_column = X_copy[column]
                new_data['Feature '+str(i)] = new_column
        return new_data

# Splitting data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    train_size=0.8, test_size=0.2, 
                                                    random_state=123, stratify=np.array(y))

# Configuring TPOT classifier with custom BERT feature
config_bert = copy.deepcopy(classifier_config_dict_light)
config_bert["__main__.BertTransformer"] = {}
config_bert['sklearn.impute.SimpleImputer'] = {}
config_bert['sklearn.preprocessing.StandardScaler'] = {}

# Create and fit TPOT classifier with BERT custom transformer
tpot_bert = TPOTClassifier(config_dict=config_bert, verbosity=2, generations=5, population_size=16,
                      template='BertTransformer-SimpleImputer-StandardScaler-PCA-Selector-Transformer-Classifier')
tpot_bert.fit(X_train, y_train)

Optimization Progress:   0%|          | 0/96 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.6962499999999999

Generation 2 - Current best internal CV score: 0.6962499999999999

Generation 3 - Current best internal CV score: 0.6962499999999999

Generation 4 - Current best internal CV score: 0.6987500000000001

Generation 5 - Current best internal CV score: 0.6987500000000001

Best pipeline: GaussianNB(Normalizer(VarianceThreshold(PCA(StandardScaler(SimpleImputer(BertTransformer(input_matrix))), iterated_power=3, svd_solver=randomized), threshold=0.1), norm=l1))


In [None]:
# Calculating balanced accuracy of BERT pipeline on test set
y_pred_bert = tpot_bert.predict(X_test)
acc_bert = balanced_accuracy_score(y_test, y_pred_bert)
acc_bert

0.6573359073359073

### Step 3: Lemmatizing transformer 

In [11]:
class LemmatizingTransformer(TransformerMixin, BaseEstimator):
    """
    Transformation algorithm
    
    Transformation is performed by tokenization, lematization, and vectorizing by Spacy model
    """
    def __init__(self):
        """
        Initializer of the class

        Contains a preloaded Spacy model and Lemmatizer
        """
        self.nlp = spacy.load("en_core_web_sm")
        self.lemmatizer = WordNetLemmatizer()
    
    def fit(self, X, y=None):
        """
        Model is already fitted

        :param X: pandas dataframe of features
        :param y: pandas dataframe of targets
        :return: current instance of the transformer class
        """
        return self
    
    def transform(self, X):
        """
        Embedding of text columns by Spacy

        :param X: pandas dataframe to be transformed
        :return: transformed pandas dataframe
        """
        X_copy = X.copy()
        output_df = pd.DataFrame()
        for i, column in enumerate(X_copy.columns):
            if type(X_copy[column].iloc[0])==str:
                # Extracting text features from the dataframe
                vectorized = []
                for sentence in X_copy[column]:
                    # Tokenization of the sentence
                    tokens = sentence.lower()
                    tokens = self.nlp(tokens)
                    tokenized = ""
                    for token in tokens:
                        if not (token.is_punct or token.is_stop):
                            # Removing stop words and punctuation signs from the list of tokens
                            tokenized+=str(token.lemma_)
                            tokenized+=" "

                    # Forming a new sentence from non-removed tokens
                    tokenized = " ".join(word_tokenize(tokenized))

                    # Lematizing words from the sentence
                    lemmatized = []
                    for word in tokenized.split():
                        lemmatized.append(str(self.lemmatizer.lemmatize(word, pos="v")))
                    string = " ".join(lemmatized)

                    # Removing repeating tokens
                    words = []
                    for j in lemmatized:
                        if (string.count(j)>=1 and (j not in words)):
                            words.append(j)
                    final_sentence = ' '.join(words)

                    # Vectorization of the final version of sentence
                    final_sentence = self.nlp(final_sentence)
                    vectorized.append(list(final_sentence.vector))
                    
                # Append new features to dataframe
                pd_vectorized = pd.DataFrame(vectorized)
                pd_vectorized = pd_vectorized.rename(columns={
                    x: "Feature."+str(i)+"."+str(x) for x in range(len(pd_vectorized.columns))
                })
                output_df = pd.concat([output_df, pd_vectorized], axis=1)
            else:
                # For non-text features
                output_df["Feature "+str(i)] = pd.DataFrame(X_copy[column])
        return output_df

# Configuring TPOT classifier with custom lemmatizing features
config_lem = copy.deepcopy(classifier_config_dict_light)
config_lem["__main__.LemmatizingTransformer"] = {}
config_lem['sklearn.impute.SimpleImputer'] = {}
config_lem['sklearn.preprocessing.StandardScaler'] = {}

# Create and fit TPOT classifier with a custom lemmatizing transformer
tpot_lem = TPOTClassifier(config_dict=config_lem, verbosity=2, generations=1, population_size=16,
                      template='LemmatizingTransformer-SimpleImputer-StandardScaler-PCA-Selector-Transformer-Classifier')
tpot_lem.fit(X_train, y_train)

Optimization Progress:   0%|          | 0/32 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.6368750000000001

Best pipeline: KNeighborsClassifier(RobustScaler(SelectPercentile(PCA(StandardScaler(SimpleImputer(LemmatizingTransformer(input_matrix))), iterated_power=8, svd_solver=randomized), percentile=15)), n_neighbors=54, p=1, weights=uniform)


In [14]:
# Calculating balanced accuracy of lemmatizing pipeline on test set
y_pred_lem = tpot_lem.predict(X_test)
acc_lem = balanced_accuracy_score(y_test, y_pred_lem)
acc_lem

0.5342127842127842

### Step 4: Pipelines comparison

In [None]:
# Choose best pipeline of BERT and Lemmatizing ones
if acc_lem > acc_bert:
    best_pipe = tpot_lem
    best_acc = acc_lem
else:
    best_pipe = tpot_bert
    best_acc = acc_bert

best_pipe.export("best_pipeline.py")
print("Best pipeline:", best_pipe)
print("Best balanced accuracy:", best_acc)

Best pipeline: TPOTClassifier(config_dict={'__main__.BertTransformer': {},
                            'sklearn.cluster.FeatureAgglomeration': {'affinity': ['euclidean',
                                                                                  'l1',
                                                                                  'l2',
                                                                                  'manhattan',
                                                                                  'cosine'],
                                                                     'linkage': ['ward',
                                                                                 'complete',
                                                                                 'average']},
                            'sklearn.decomposition.PCA': {'iterated_power': range(1, 11),
                                                          'svd_solver': ['randomized']},
           