# Learning by Genetic Algorithm

I implemented a genetic algorithm in three versions, designed to classify stocks as 'Buy'. For each version, the fitness function is based on running a backtest on data from approximately 2005 to 2019. Fitness is calculated as the weighted average of the overall score and the scores from each year.
In this genetic algorithm, there is no mating (crossover) mechanism. The best chromosomes generate several offspring on their own, and the only factor that changes genes is mutation.

Algorithm versions:
1. A version similar to logistic regression. The chromosome consists of weights for each predictor (ranging from -10 to 10), and the algorithm classifies a stock as 'Buy' if the sum of weight * predictor exceeds 10.
2. In this version, the chromosome consists of baselines for each predictor. If any predictor is lower than its baseline, the stock cannot be classified as 'Buy'. However, if all predictors are above their baselines, the stock is classified as 'Buy'.
3. In this version, the chromosome also consists of baselines, but stock classification works differently. Classification is done using logistic regression on filtered stocks (those that meet all baselines).

## Imports & loading the data

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

In [2]:
data = pd.read_csv('stocks_data4.csv')
data.describe(include='all')

Unnamed: 0.1,Unnamed: 0,Ticker,Year,Month,Price,MA Ratio,Buy,Result,ROE,ROA,ROI,Insider Ownership Growth,Institutional Ownership Growth,Forecast EPS Growth,Avg 2Q EPS Growth,Avg 2Q EPS Surprise,YoY EPS Growth,Sector Performance,Market Performance,Benchmark SP500 Performance
count,61517.0,61517,61517.0,61517.0,61517.0,61517.0,61517.0,61517.0,61517.0,61517.0,61517.0,61517.0,61517.0,61517.0,61517.0,61517.0,61517.0,61517.0,61517.0,61517.0
unique,,417,,,,,,,,,,,,,,,,,,
top,,CPB,,,,,,,,,,,,,,,,,,
freq,,225,,,,,,,,,,,,,,,,,,
mean,30758.0,,2015.561064,6.475576,86.090852,1.005282,0.454102,1.035127,0.262087,0.08961,0.15072,0.031192,0.025757,0.090401,0.149241,11.124411,3969275000000.0,1.641091,1.603703,1.023901
std,17758.572592,,5.130421,3.448068,145.204509,0.043844,0.497893,0.139926,8.540118,1.104599,1.285496,0.891108,0.250149,1.732801,1.515826,52.925722,430114400000000.0,7.192112,6.071008,0.072566
min,0.0,,2005.0,1.0,0.17,0.580721,0.0,0.110349,-347.69357,-1.36977,-15.3364,-0.994779,-0.930676,-0.992366,-58.668103,-93.235,-1.0,-49.501466,-24.778692,0.690014
25%,15379.0,,2012.0,3.0,26.35,0.982233,0.0,0.955759,0.09591,0.03754,0.06528,-0.003494,-0.020228,-0.156716,-0.036162,1.27,0.01214575,-2.055089,-1.243019,0.989798
50%,30758.0,,2016.0,6.0,49.24,1.006824,0.0,1.035871,0.1664,0.07018,0.11467,0.0,0.00079,-0.016129,0.047591,4.74,0.1157895,2.177343,2.256661,1.034544
75%,46137.0,,2020.0,9.0,95.34,1.029836,1.0,1.113424,0.26634,0.11323,0.18696,0.008696,0.027434,0.11,0.157784,10.68,0.2425068,5.800866,5.36785,1.066909


## Split the dataset for testing and training

In [3]:
cut_off_year = 2019

data = data.reset_index(drop=True)
train_data = data[(data['Year'] < cut_off_year) & ((data['Year'] != cut_off_year - 1) | (data['Month'] < 9))].copy()
test_data = data[data['Year'] >= cut_off_year].copy()
x_train = train_data.drop(['Year', 'Buy', 'Month', 'Ticker', 
                           'Result', 'Benchmark SP500 Performance', 
                           'Price', data.columns[0]], axis=1)
y_train = train_data['Buy']
x_test = test_data.drop(['Year', 'Buy', 'Month', 
                         'Ticker', 'Result', 'Benchmark SP500 Performance', 
                         'Price', data.columns[0]], axis=1)
y_test = test_data['Buy']

print(f"Amount of train data: {len(train_data)}")
print(f"Amount of test data: {len(test_data)}")

scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)

x_test = scaler.transform(x_test)

Amount of train data: 40199
Amount of test data: 19829


## Define fitness function (backtest)

In [60]:
def sell_stock(ticker, price, backtest_status):
    prev_price = backtest_status['current_buys'][ticker]['price']
    amount = backtest_status['current_buys'][ticker]['shares']

    backtest_status['portfolio_worth'] -= prev_price * amount
    backtest_status['portfolio_worth'] += price * amount
    backtest_status['available_cash'] += price * amount
    
def backtest(backtest_data, backtest_status):
    prev_year = None
    for index, row in backtest_data.iterrows():
        if row['Year'] != prev_year  and prev_year is not None:
            net_worth = backtest_status['portfolio_worth']
            for ticker in backtest_status['current_buys']:
                price = backtest_status['current_buys'][ticker]['last_price']
                prev_price = backtest_status['current_buys'][ticker]['price']
                amount = backtest_status['current_buys'][ticker]['shares']
            
                net_worth -= prev_price * amount
                net_worth += price * amount
                
            backtest_status['portfolio_worths_each_year'].append(net_worth)
        
        prev_year = row['Year']
        ticker = row['Ticker']
        prediction = row['Predicted_Buy']
        price = row['Price']
        if prediction == True:
            if ticker not in backtest_status['current_buys']:
                allowed_spend = int(backtest_status['portfolio_worth'] / 5)
                
                if allowed_spend > backtest_status['available_cash']:
                    allowed_spend = backtest_status['available_cash']
    
                if allowed_spend < backtest_status['portfolio_worth'] / 50:
                    continue
                    
                amount = int(allowed_spend / price)
                backtest_status['available_cash'] -= amount * price
                
                backtest_status['current_buys'][ticker] = {'price': price, 'shares': amount, 'last_price': price}
                # print(f"Added {ticker} to current_buys for {row['Year']}-{row['Month']} with price {price}")
            else:
                if price < backtest_status['current_buys'][ticker]['price'] * 0.95: # Stop loss
                    prev_price = backtest_status['current_buys'][ticker]['price']
                    
                    sell_stock(ticker, price, backtest_status)
                    del backtest_status['current_buys'][ticker]
                    # print(f"Removed {ticker} from current_buys for {row['Year']}-{row['Month']} with price {price}; prev price: {prev_price}")
                    # print(f"New net worth: {backtest_status['portfolio_worth']}")
                    
                else:
                    backtest_status['current_buys'][ticker]['last_price'] = price

        else:
            if ticker in backtest_status['current_buys']:
                prev_price = backtest_status['current_buys'][ticker]['price']
    
                sell_stock(ticker, price, backtest_status)            
                del backtest_status['current_buys'][ticker]
                # print(f"Removed {ticker} from current_buys for {row['Year']}-{row['Month']} with price {price}; prev price: {prev_price}")
                # print(f"New net worth: {backtest_status['portfolio_worth']}")
    
    for ticker in backtest_status['current_buys']:
        sell_stock(ticker, backtest_status['current_buys'][ticker]['last_price'], backtest_status)
        # print(f"Removed {ticker} from current_buys")

In [5]:
def calculate_fitness(predictions):
    train_data['Predicted_Buy'] = predictions
    backtest_data = train_data.sort_values(by=['Year', 'Month'])

    backtest_status = {
        'portfolio_worths_each_year': [],
        'available_cash': 1000000,
        'portfolio_worth': 1000000,
        'current_buys': {}
    }

    backtest(backtest_data, backtest_status)

    # Calculate weighted average as fitness score
    yearly_results = []
    for i in range(1, len(backtest_status['portfolio_worths_each_year'])):
        previous_year = backtest_status['portfolio_worths_each_year'][i - 1]
        current_year = backtest_status['portfolio_worths_each_year'][i]
        
        growth = ((current_year - previous_year) / previous_year) * 100
        yearly_results.append(growth)
    
    overall_result = ((backtest_status['portfolio_worth'] - 1000000) / 1000000) * 100
    overall_result_weight = 0.4
    
    sorted_yearly_results = sorted(yearly_results)
    num_results = len(yearly_results)
    
    results_weights = np.linspace(0.3, 0.5, num_results)
    results_weights = results_weights[np.argsort(np.argsort(yearly_results))]
    
    results_weights = results_weights * (1 - overall_result_weight) / sum(results_weights)

    weighted_average = overall_result_weight * overall_result + sum(rw * yr for rw, yr in zip(results_weights, yearly_results))

    return weighted_average

## Define backtest on the test data

In [38]:
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix, classification_report
def backtest_test_data(chr):
    test_results = test_data.copy()
    backtest_status = {
        'portfolio_worths_each_year': [],
        'available_cash': 1000000,
        'portfolio_worth': 1000000,
        'current_buys': {}
    }
    test_results['Predicted_Buy'] = chr.calculate_predictions(x_test, y_test)
    backtest_data = test_results.sort_values(by=['Year', 'Month'])
    
    backtest(backtest_data, backtest_status)
    
    print(backtest_status)

## Define Genetic Algorithm

In [7]:
class GenAlg:
    def __init__(self, population_size=120, generations=20):
        self.population_size = population_size
        self.chromosomes = [Chromosome() for _ in range(population_size)]
        self.generations = generations
        self.current_best = self.chromosomes[0]

    def run(self):
        for i in range(self.generations):
            print(f"Generation #{i+1}")
            self.create_new_generation()

    def create_new_generation(self):
        best_chromosomes = self.select_best_chromosomes()
        self.chromosomes = []
        for chr in best_chromosomes:
            for child in range(int(self.population_size / len(best_chromosomes))):
                self.chromosomes.append(chr.create_child())
                
        self.population_size = len(self.chromosomes)
                
    def select_best_chromosomes(self, top_n=30):
        fitness_scores = []
        for idx, chromosome in enumerate(self.chromosomes):
            fitness = chromosome.calculate_fitness()
            fitness_scores.append((chromosome, fitness))
        
        sorted_chromosomes = sorted(fitness_scores, key=lambda x: x[1], reverse=True)
        
        best_chromosomes = [chromosome for chromosome, fitness in sorted_chromosomes[:top_n]]

        if best_chromosomes[0].fitness > self.current_best.fitness:
            self.current_best = best_chromosomes[0]
        
        return best_chromosomes

## First Chromosome Version - Predictor Weights

In [11]:
class Chromosome:
    def __init__(self, weights = None, mutation_strength=0.2):
        if weights is None:
            self.weights = np.random.uniform(-10, 10, size=x_train.shape[1])
        else:
            self.weights = weights
        
        self.mutation_strength = mutation_strength
        self.fitness = 0

    def calculate_predictions(self, x_, y_):        
        weighted_sums = np.dot(x_, self.weights)
        predictions = np.zeros(x_.shape[0])
        predictions[weighted_sums > 10] = 1
        predictions[weighted_sums <= 10] = 0

        return predictions

    def calculate_fitness(self):
        pred = self.calculate_predictions(x_train)
        self.fitness = calculate_fitness(pred)

        return self.fitness

    def create_child(self):
        new_weights = self.weights + np.random.normal(0, self.mutation_strength, size=len(self.weights))
        new_weights = np.clip(new_weights, -10, 10)
        return Chromosome(new_weights)

#### Run the algorithm

In [9]:
genAlg = GenAlg()
genAlg.run()

Generation #1
Generation #2
Generation #3
Generation #4
Generation #5
Generation #6
Generation #7
Generation #8
Generation #9
Generation #10
Generation #11
Generation #12
Generation #13
Generation #14
Generation #15
Generation #16
Generation #17
Generation #18
Generation #19
Generation #20


#### Test some of the best chromosomes on the test data

In [10]:
best_chrs = genAlg.select_best_chromosomes()
backtest_test_data(genAlg.current_best)
for i in range(10):
    backtest_test_data(best_chrs[i])

{'portfolio_worths_each_year': [1069502.6699999995, 1075332.7799999993, 1478987.199999999, 1537033.7399999984, 1679570.6299999987], 'available_cash': 1815229.8800000004, 'portfolio_worth': 1815229.879999999, 'current_buys': {'VTR': {'price': 30.23, 'shares': 6515, 'last_price': 49.6}, 'RL': {'price': 110.34, 'shares': 1959, 'last_price': 106.58}, 'MRNA': {'price': 178.99, 'shares': 182, 'last_price': 174.3}, 'LLY': {'price': 581.27, 'shares': 157, 'last_price': 774.39}, 'PARA': {'price': 10.85, 'shares': 3590, 'last_price': 12.2}, 'PODD': {'price': 168.1, 'shares': 1917, 'last_price': 170.27}, 'IT': {'price': 422.39, 'shares': 698, 'last_price': 422.39}, 'META': {'price': 438.75, 'shares': 648, 'last_price': 438.75}, 'WY': {'price': 30.0, 'shares': 6021, 'last_price': 30.0}}}
{'portfolio_worths_each_year': [1160682.7000000004, 1076325.1099999999, 1852239.46, 1856212.0499999996, 1663368.0200000005], 'available_cash': 1772650.2199999997, 'portfolio_worth': 1772650.2200000007, 'current_bu

The test data for the backtest covers the period from 2019 to August 2024. During this time, the S&P 500 returned approximately 105%, meaning that an initial investment of one million dollars would have grown to around 2.05 million. As you can see, few of the chromosomes outperformed this result in the backtest. Most achieved returns in the range of 60-80%. The chromosome with the highest return (second in the last generation in terms of fitness score) achieved a return of 156%. Here are the predictor weights for some of the top-performing chromosomes:

In [12]:
print(genAlg.current_best.weights)
for i in range(5):
    print('\n')
    print(best_chrs[i].weights)

[-7.90708456  4.29737714  5.93671314  1.91524808 -7.72473395 -0.78799237
  5.25908675  5.70905357  8.4275202   8.5122107   3.7897766  -1.25851244]


[-8.82306839  4.24722501  6.05406064  5.11662098 -5.48591707 -0.98062678
  6.58427808  7.19877335  9.00370521  6.84222354  3.29485937 -0.99709381]


[-7.77944686 -2.47792749  5.31608763 -3.73183824  7.40215546  2.5009884
  0.49561469  2.69921535 -1.4655852   4.61259733 -1.93421018  5.02073568]


[-7.87187444  4.43637749  6.22118834  2.10846312 -6.63180184 -1.06948285
  4.97875926  5.39788678  8.91613092  8.09344097  3.45016851 -1.05172929]


[-8.03661089  4.08112127  5.50853169  1.78640004 -7.50359675 -1.01872008
  5.16695851  5.62206328  8.73651952  7.86294062  3.98414527 -1.81771959]


[-8.01046623  4.36177775  6.51644934  1.60453504 -6.99615608 -0.72716305
  5.22311003  5.66541314  8.70363932  8.12660836  3.3223385  -0.77116224]


## Second Chromosome Version - Predictor Baselines

In [13]:
class Chromosome:
    def __init__(self, baselines=None, mutation_strength=0.05):
        if baselines is None:
            self.baselines = np.random.uniform(-3, 0.5, size=x_train.shape[1])
        else:
            self.baselines = baselines

        self.mutation_strength = mutation_strength
        self.fitness = 0

    def calculate_predictions(self, x_, y_):        
        predictions = np.all(x_ >= self.baselines, axis=1).astype(int)
        
        return predictions

    def calculate_fitness(self):
        pred = self.calculate_predictions(x_train)
        self.fitness = calculate_fitness(pred)

        return self.fitness

    def create_child(self):
        new_baselines = self.baselines + np.random.normal(0, self.mutation_strength, size=len(self.baselines))
        new_baselines = np.clip(new_baselines, -10, 2)
        return Chromosome(new_baselines)

#### Run the algorithm

In [14]:
genAlg = GenAlg()
genAlg.run()

Generation #1
Generation #2
Generation #3
Generation #4
Generation #5
Generation #6
Generation #7
Generation #8
Generation #9
Generation #10
Generation #11
Generation #12
Generation #13
Generation #14
Generation #15
Generation #16
Generation #17
Generation #18
Generation #19
Generation #20


#### Test some of the best chromosomes on the test data

In [15]:
best_chrs = genAlg.select_best_chromosomes()
backtest_test_data(genAlg.current_best)
for i in range(10):
    backtest_test_data(best_chrs[i])

{'portfolio_worths_each_year': [1202073.1300000006, 1463094.6600000015, 1551798.8000000021, 1748824.6400000025, 2142867.9600000028], 'available_cash': 2442413.1400000006, 'portfolio_worth': 2442413.1400000025, 'current_buys': {'IPG': {'price': 16.49, 'shares': 12747, 'last_price': 32.97}, 'PHM': {'price': 51.78, 'shares': 1802, 'last_price': 51.78}, 'CTVA': {'price': 53.18, 'shares': 695, 'last_price': 55.24}, 'AEE': {'price': 70.0, 'shares': 5298, 'last_price': 73.79}, 'KKR': {'price': 98.61, 'shares': 4010, 'last_price': 94.81}, 'LLY': {'price': 779.74, 'shares': 160, 'last_price': 774.39}, 'ED': {'price': 88.83, 'shares': 2293, 'last_price': 93.23}, 'ES': {'price': 60.9, 'shares': 6842, 'last_price': 60.9}, 'KR': {'price': 54.16, 'shares': 6738, 'last_price': 54.16}}}
{'portfolio_worths_each_year': [1275373.2599999998, 1591098.4700000004, 1628680.1600000015, 1628663.180000002, 1870057.1300000022], 'available_cash': 1837662.1800000002, 'portfolio_worth': 1837662.180000002, 'current_b

This time, the algorithm performed slightly better on average in the backtests. Most chromosomes achieved returns above 90%, and the chromosome with the highest return reached 145%. Here are the baseline values for the predictors for some of the top-performing chromosomes:

In [16]:
print(genAlg.current_best.baselines)
for i in range(5):
    print('\n')
    print(best_chrs[i].baselines)

[-1.62643514 -0.72140548 -2.09943423 -2.08296282 -0.71007717 -0.35751176
 -0.12343241  0.01446743 -0.18718315 -1.65394075 -2.36479816 -2.45316202]


[-2.82190461 -1.53258433 -2.0311165  -2.90928006 -1.97445271 -2.59073975
  0.23496754 -0.56002061 -1.22981771 -1.69195768 -2.16818645 -1.42213133]


[-3.44497627 -1.64598587 -1.91211296 -3.00708738 -1.83583169 -2.36205559
  0.31016912 -0.91901885 -1.3024518  -1.71549092 -2.65703525 -1.42436897]


[-2.80928167 -1.89030879 -1.88329041 -2.65257293 -1.6988751  -2.27566413
  0.31620656 -0.8143083  -1.50728974 -1.7939793  -2.45640519 -1.38971403]


[-3.30371217 -1.59219441 -1.85556449 -3.0137365  -1.89730404 -2.36616799
  0.32112322 -1.05083913 -1.24276364 -1.67315729 -2.60524448 -1.39622892]


[-3.16116706 -1.72783171 -1.735197   -2.7086135  -2.02517779 -2.66475322
  0.29941901 -0.98892094 -1.35066328 -1.3865719  -2.35729334 -1.48733494]


## Third Chromosome Version - Predictor Baselines + Logistic Regression

In [35]:
from sklearn.linear_model import LogisticRegression
class Chromosome:
    def __init__(self, baselines=None, mutation_strength=0.08):
        if baselines is None:
            self.baselines = np.random.uniform(-5, 0.5, size=x_train.shape[1])
        else:
            self.baselines = baselines

        self.mutation_strength = mutation_strength
        self.fitness = 0

    def logistic_regression_predict(self, x_, y_):
        model = LogisticRegression(max_iter = 100)
        model.fit(x_, y_)
        return model.predict(x_)
        
    def calculate_predictions(self, x_, y_ = y_train):        
        selected_rows_mask = np.all(x_ >= self.baselines, axis=1)
        selected_rows = x_[selected_rows_mask]
        y_selected = y_[selected_rows_mask]

        predictions = np.zeros(x_.shape[0], dtype=int)
        
        if selected_rows.shape[0] > 0 and len(np.unique(y_selected)) == 2:
            logistic_predictions = self.logistic_regression_predict(selected_rows, y_selected)
            predictions[selected_rows_mask] = logistic_predictions
            return predictions

        return None

    def calculate_fitness(self):
        self.fitness = 0
        pred = self.calculate_predictions(x_train)
        
        if pred is not None:
            self.fitness = calculate_fitness(pred)

        return self.fitness

    def create_child(self):
        new_baselines = self.baselines + np.random.normal(0, self.mutation_strength, size=len(self.baselines))
        new_baselines = np.clip(new_baselines, -10, 3)
        return Chromosome(new_baselines)

#### Run the algorithm

In [36]:
genAlg = GenAlg()
genAlg.run()

Generation #1
Generation #2
Generation #3
Generation #4
Generation #5
Generation #6
Generation #7
Generation #8
Generation #9
Generation #10
Generation #11
Generation #12
Generation #13
Generation #14
Generation #15
Generation #16
Generation #17
Generation #18
Generation #19
Generation #20


#### Test some of the best chromosomes on the test data

In [39]:
best_chrs = genAlg.select_best_chromosomes()
backtest_test_data(genAlg.current_best)
for i in range(10):
    backtest_test_data(best_chrs[i])

{'portfolio_worths_each_year': [992394.74, 1117605.19, 1232419.6099999999, 1333577.65, 1321355.9999999998], 'available_cash': 1382029.57, 'portfolio_worth': 1382029.5699999994, 'current_buys': {'CF': {'price': 43.0, 'shares': 6013, 'last_price': 43.0}}}
{'portfolio_worths_each_year': [1087413.2899999998, 1216091.97, 1408432.2799999998, 1605226.9300000002, 1634975.68], 'available_cash': 1744087.17, 'portfolio_worth': 1744087.17, 'current_buys': {}}
{'portfolio_worths_each_year': [1077276.7299999997, 1243916.4499999997, 1423071.7399999993, 1621912.0899999992, 1646113.869999999], 'available_cash': 1757560.2599999993, 'portfolio_worth': 1757560.2599999986, 'current_buys': {}}
{'portfolio_worths_each_year': [1093535.2500000005, 1223855.5900000005, 1461838.5300000003, 1662186.0600000008, 1709729.0900000012], 'available_cash': 1857121.3700000006, 'portfolio_worth': 1857121.3700000013, 'current_buys': {}}
{'portfolio_worths_each_year': [1118546.81, 1242553.07, 1419298.8000000003, 1584204.4, 16

## Conclusions

As can be seen, most of the chromosomes (in each version) performed worse than the S&P 500. The poor results are likely due to the low number of generations and the small population size. These values are low because I wanted to avoid lengthy training processes, which can be particularly time-consuming due to the fitness function, which involves backtesting on a large amount of data. The second algorithm (using only baselines) yielded the most promising results, as it appears to be the most stable; a significant portion of the best chromosomes achieved results close to the S&P 500, with several exceeding it. Overall, this genetic algorithm produces weaker backtest results compared to logistic regression, but it may offer more reliability in maintaining results in the future.