In [19]:
import pandas as pd;
import random;
from statistics import fmean, stdev;

from sklearn import svm
from sklearn.model_selection import KFold
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import DistanceMetric
from sklearn.ensemble import RandomForestClassifier

OBJECTS/VARIABLES:

Antibody - list of Numerical values (Reals?) , represented by a row of a dataframe?
Population - List[Antibodies], represented by the total dataframe?

Target Size (i.e number of antibodies to be created) = (Size of Majority Class) - (Size of minority class)



FUNCTIONS:

Initializing :- input: original dataframe

                         do: Get bounds of minority class by taking the highest and lowest values in each of the [n] dimensions

                         How: Do we take the whole minority class? or do we sample part or parts of it to generate our bounds. 

                         output: Upper and lower bounds of the minority class


Creation :- Input: Bounds of the min class

        do: Create a set of antibodies

        input: minority Dataframe 

        How:
            Possibilities:
            (As the Malhanabois paper does it) Take a random value between the bounds of the minority class feature as the datapoint

            (Nikhil Just sample minority class based on imbalance rate (doesn't require bounds)

            (Adam) Take a random value, as in the paper's method, but off of a weighted curve? as in we could randomize, 
                but add some preference for values close to the boundary, or close to the center, etc.
                        -We could set this as a parameter, bell curve, linear. This same function could be a parameter in the mutation stage.
                        -If the density within the bounds is concentrated on one side, add bias towards that side in the random value;
        
        Challenges: 
            How do we deal with the different data categories (e.g nominal, ordinal, and continuous)? Continuous is easy, just a number in a range. Ordinal is ???, nominal is difficult, even if one-hot encoded, we might random to have two values that should be exclusive (e.g an item being both blue and red). How do other imputation algorithms work with these problems? DO they even work with these problems?
        output: Initial Population as a DF?



**Initialization**

In [2]:

df = pd.read_csv("./Data/TitanicData_syn.csv")

columns = df.columns.to_list()
columns_drop = columns.pop(-1)

#drop NaN rows, could implement imputer as well
df.dropna(inplace=True)

labels = df.drop(columns, axis=1)

df= df.drop("Result", axis=1)
df


Unnamed: 0,class2,AGE,Sex
0,0.0214,-0.228,-1.920
1,0.9650,-0.228,0.521
2,0.0214,-0.228,0.521
3,0.9650,-0.228,0.521
4,0.9650,-0.228,0.521
...,...,...,...
1996,-0.9230,-0.228,-1.920
1997,0.0214,-0.228,0.521
1998,0.0214,4.380,-1.920
1999,-0.9230,-0.228,0.521


In [3]:

count_nan = df.isnull().sum()
count_nan


class2    0
AGE       0
Sex       0
dtype: int64

In [25]:


def get_bounds(minorityDF) -> tuple:
    out = []
    for col in minorityDF:
        colMax = df[col].max()
        colMin = df[col].min()
        out += [(col, colMin, colMax, )]
    return out

#This only works for continuous values. We will have to code a version for binary fields (We assume any categorical columns have been encoded)
####### Creation ################
# minorityDF - dataframe containing the minority class
# totalPopulation - The total number of antibodies to create
# weightingFunction - Can choose between uniform, triangular, ...
# mode - for use with a triangular function - set to the percentage of the range you wish to be most represented (between 0.0 and 1.0)
def Creation(minorityDF, totalPopulation : int, weightingFunction : str = "uniform", mode : float = 0.5): 
    
    if(minorityDF.isnull().values.any()):
        raise ValueError("Minority Class DataFrame contains NaN")
    
    population = [] #Initializing the empty population
    if mode < 0.0 or mode > 1:
        raise Exception("mode must be between float value between 0.0 and 1.0")
    

    for i in range(totalPopulation): #For every antibody to be created

        antibody = [] #Initializing a single antibody
        if weightingFunction in ["uniform", "triangular"]: #If Generating via uniform or triangular distribution, loop through bounds of columns
            
            for col in get_bounds(minorityDF): #Iterate through the columns/dimensions/features of the minority class for each antibody 
                if weightingFunction == "uniform":
                    antibody += [round(random.uniform(col[1],col[2]),4)] #Add a random value between the lower and upper bounds to the antibody

                elif (weightingFunction == "triangular"):
                    
                    tri_tip = ( ((col[2]-col[1]) * mode) + col[1] ) #multiplying the difference by the percentage, plus the low bound gives us the point between the two, but percentile

                    if tri_tip < col[1]: #Error checks to make sure that the emphasized point isn't outside the bounds
                        tri_tip = col[1]
                    elif tri_tip > col[2]:
                        tri_tip = col[2]

                    antibody += [round( random.triangular(col[1],col[2], tri_tip), 5)]

            population+=[antibody] #add the created antibody to the population

        elif (weightingFunction == 'gauss'): #If Generating via Gaussian, loop through columns of dataframe

            for col in minorityDF:
                values = minorityDF[col].tolist()

                antibody += [round(random.gauss(fmean(values) , stdev(values)), 5)]

        
            population+=[antibody] #add the created antibody to the population

            

    return population
    

print(Creation(df,3, weightingFunction='gauss'))

[[-0.18553, 0.6865, -1.32287], [1.36359, 0.43997, 0.32751], [-0.81489, 1.12035, -0.34511]]


**Fitness Function**

Requirements: Needs to be calced fast bc of multiple iterations
Posiibilities: - Binary Classification F1 Score, Mahalanobis Distance?
               - Other Types as well? : Linear Regression, Multiilabel Classification

Do we just impute our values and then do something similar to StudentPerformance and see what happens? No bc we need input from the fitness function to do our generations.

Is the data just our training set?
Inputs: Model(initialized outside function or inside?) fit with data that has been encoded and the label

Want to do kfold cv (not every loop bc very slow, once afterwards to evaluate)

when we do k fold, call fitness funciton k times i.e. once for every train test split.

if doing grid search, do it before calling this?

In [70]:
# calculates the fitness score for one train/test split dataset
# run on original dataset without random values first to be abe to compare


# def fitness(train_feat, test_feat, train_label, test_label, model):

#     model.fit(train_feat,train_label)
#     predictions = model.predict(test_feat) 

#     return f1_score(test_label, predictions, average='macro')

# def kfold_cv(n, feat, label):

#     kf = KFold(n_splits = n , random_state=None, shuffle=False)
#     for train_index, test_index in kf.split(df):
#         train_feat, test_feat = feat[train_index], feat[test_index]
#         train_label, test_label = label[train_index], label[test_index]

def fitness( model, feat, label, iterations, scorer):
    #scorer is the name of the function wee aree using to evaluate our dataset
    #it should be a function with signature scorer(model, feature, label) which should return only a single value.
    return cross_val_score(model, feat, label, cv = iterations, scoring = scorer)

def distance( x, y, metric):
    
    #get the distance between two sets of data x and y, they should be the same size
    #metric is the string metric to be used to measure distance

    dist = DistanceMetric.get_metric(metric)
    return dist.pairwise(x,y)

In [72]:



randomForest = RandomForestClassifier()
#randomForest = randomForest.fit(df,labels)
clf = svm.SVC(random_state=0)

fitness(clf, df, labels.values.ravel(), 5, 'recall_macro')


array([0.65982906, 0.65413105, 0.68304843, 0.68490028, 0.67054264])