# Random Forest Classification
This programm is runs Random Forest Classification in order to predict wether stock is a good option to buy. Stock is classified as 'Buy' if it'll beat SP500 and ROI is above 2%.
Randomized Search is used to optimize parameters for model's precision.

### 1. Imports

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix, classification_report, make_scorer, recall_score
from sklearn.preprocessing import StandardScaler
from scipy.stats import randint

## First attempt (fewer data)

### 2. Load the data

In [2]:
data = pd.read_csv('stocks_data.csv')
data.describe(include='all')

Unnamed: 0.1,Unnamed: 0,Ticker,Year,Month,MA Ratio,Buy,Result,ROE,Insider Ownership Growth,Institutional Ownership Growth,Forecast EPS Growth,Avg 2Q EPS Growth,Avg 2Q EPS Surprise,YoY EPS Growth,Sector Performance,Market Performance,Benchmark SP500 Performance
count,14854.0,14854,14854.0,14854.0,14854.0,14854.0,14854.0,14854.0,14854.0,14854.0,14854.0,14854.0,14854.0,14854.0,14854.0,14854.0,14854.0
unique,,393,,,,,,,,,,,,,,,
top,,AWK,,,,,,,,,,,,,,,
freq,,55,,,,,,,,,,,,,,,
mean,7426.5,,2020.669113,6.22324,1.004148,0.415848,1.032817,39.494365,0.015486,0.026708,0.057775,0.181477,13.755183,0.369529,1.488003,1.438443,1.029219
std,4288.124784,,1.428016,3.520757,0.046473,0.492884,0.14926,181.839873,0.269863,0.230675,2.136724,2.111809,46.751483,3.637998,8.164589,7.038394,0.077491
min,0.0,,2018.0,1.0,0.580721,0.0,0.259712,-613.743387,-0.633527,-0.714136,-0.992366,-45.05,-65.625,-0.961538,-44.900728,-22.795349,0.769903
25%,3713.25,,2019.0,3.0,0.977766,0.0,0.944153,10.160854,-0.00135,-0.023114,-0.184264,-0.040838,2.015,0.017606,-3.453784,-3.160007,0.983314
50%,7426.5,,2021.0,6.0,1.00536,0.0,1.028547,19.251991,0.0,-0.000648,-0.039062,0.045662,6.055,0.130688,1.496227,2.069271,1.043101
75%,11139.75,,2022.0,9.0,1.031953,1.0,1.113949,31.949569,0.008,0.033653,0.086957,0.154182,13.135,0.275148,6.429508,5.50743,1.080728


### 3. Split the data for train and test, standarise the data

In [4]:
data = data.reset_index(drop=True)
train_data = data[data['Year'] <= 2022]
test_data = data[data['Year'] > 2022]
x_train = train_data.drop(['Year', 'Buy', 'Month', 'Ticker', 'Result', 'Benchmark SP500 Performance', data.columns[0]], axis=1)
y_train = train_data['Buy']
x_test = test_data.drop(['Year', 'Buy', 'Month', 'Ticker', 'Result', 'Benchmark SP500 Performance', data.columns[0]], axis=1)
y_test = test_data['Buy']

scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

### 4. Train the model with randomized search

In [39]:
param_dist = {
    'n_estimators': randint(100, 800),
    'max_depth': randint(10, 50),
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10),
    'class_weight': [{0: 3, 1: 1}, {0: 5, 1: 2}, {0: 2, 1: 1}, {0: 8, 1: 5}, {0: 6, 1: 5}, {0: 5, 1: 6}, {0: 1, 1: 2}, None]
}
random_search = RandomizedSearchCV(estimator=RandomForestClassifier(random_state=42),
                                   param_distributions=param_dist,
                                   n_iter=20, 
                                   cv=5,
                                   scoring='precision',
                                   n_jobs=-1,
                                   verbose=2,
                                   random_state=42)
random_search.fit(x_train, y_train)
best_model = random_search.best_estimator_
print("Best Parameters:", random_search.best_params_)

y_pred = best_model.predict(x_test)

Fitting 5 folds for each of 20 candidates, totalling 100 fits
Best Parameters: {'class_weight': {0: 2, 1: 1}, 'max_depth': 31, 'min_samples_leaf': 5, 'min_samples_split': 3, 'n_estimators': 443}


### 5. Evaluation

In [40]:
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print('Confusion Matrix:')
print(conf_matrix)
print('Classification Report:')
print(class_report)

Accuracy: 0.654434250764526
Precision: 0.4224137931034483
Confusion Matrix:
[[1021   67]
 [ 498   49]]
Classification Report:
              precision    recall  f1-score   support

           0       0.67      0.94      0.78      1088
           1       0.42      0.09      0.15       547

    accuracy                           0.65      1635
   macro avg       0.55      0.51      0.47      1635
weighted avg       0.59      0.65      0.57      1635



In [41]:
test_results = test_data.copy()
test_results['Predicted_Buy'] = y_pred

predicted_stocks_to_buy = test_results[test_results['Predicted_Buy'] == 1]
predicted_stocks_avg_return = predicted_stocks_to_buy['Result'].mean()

avg_stock_return = test_results['Result'].mean()

best_stocks = test_results[test_results['Buy'] == 1]
avg_best_stocks_return = best_stocks['Result'].mean()

sp500_return = predicted_stocks_to_buy['Benchmark SP500 Performance'].mean()

print("Benchmarks: ")
print(f"Average stock return (whole test sample): {avg_stock_return:.5f}")
print(f"'Buy' stocks average return: {avg_best_stocks_return:.5f}")
print(f"SP500 return: {sp500_return:.5f}")

print(f"\nModel's predicted stock average return: {predicted_stocks_avg_return:.5f}")

Benchmarks: 
Average stock return (whole test sample): 1.03164
'Buy' stocks average return: 1.17566
SP500 return: 1.05889

Model's predicted stock average return: 1.05902


### 6. Conclusion

The model demonstrates some promising aspects, although there is room for improvement. The precision for the positive class (class 1) is 42.24%, which, while not ideal, indicates that when the model does identify a positive case, there is a reasonable likelihood of success. Additionally, the model's predicted stock average return of 1.05902 slightly outperforms the S&P 500 return of 1.05889, showing that the model has potential in generating competitive returns.

To further capitalize on this potential, enhancing recall for the positive class through more data, feature engineering, or alternative modeling approaches could be beneficial. By refining these areas, the model could become a more reliable tool for developing an effective investment strategy.

## Second attempt (more data)
In this attempt, data with a broader date range (from around 2008 for most companies) was used. This range includes several recession periods, providing greater diversity for the model to learn from. Additionally, columns containing information on the companies' Return on Assets and Return on Invested Capital were added.

### 1. Read & preprocess the data

In [3]:
data = pd.read_csv('stocks_data4.csv')
data.describe(include='all')

Unnamed: 0.1,Unnamed: 0,Ticker,Year,Month,Price,MA Ratio,Buy,Result,ROE,ROA,ROI,Insider Ownership Growth,Institutional Ownership Growth,Forecast EPS Growth,Avg 2Q EPS Growth,Avg 2Q EPS Surprise,YoY EPS Growth,Sector Performance,Market Performance,Benchmark SP500 Performance
count,61517.0,61517,61517.0,61517.0,61517.0,61517.0,61517.0,61517.0,61517.0,61517.0,61517.0,61517.0,61517.0,61517.0,61517.0,61517.0,61517.0,61517.0,61517.0,61517.0
unique,,417,,,,,,,,,,,,,,,,,,
top,,CPB,,,,,,,,,,,,,,,,,,
freq,,225,,,,,,,,,,,,,,,,,,
mean,30758.0,,2015.561064,6.475576,86.090852,1.005282,0.454102,1.035127,0.262087,0.08961,0.15072,0.031192,0.025757,0.090401,0.149241,11.124411,3969275000000.0,1.641091,1.603703,1.023901
std,17758.572592,,5.130421,3.448068,145.204509,0.043844,0.497893,0.139926,8.540118,1.104599,1.285496,0.891108,0.250149,1.732801,1.515826,52.925722,430114400000000.0,7.192112,6.071008,0.072566
min,0.0,,2005.0,1.0,0.17,0.580721,0.0,0.110349,-347.69357,-1.36977,-15.3364,-0.994779,-0.930676,-0.992366,-58.668103,-93.235,-1.0,-49.501466,-24.778692,0.690014
25%,15379.0,,2012.0,3.0,26.35,0.982233,0.0,0.955759,0.09591,0.03754,0.06528,-0.003494,-0.020228,-0.156716,-0.036162,1.27,0.01214575,-2.055089,-1.243019,0.989798
50%,30758.0,,2016.0,6.0,49.24,1.006824,0.0,1.035871,0.1664,0.07018,0.11467,0.0,0.00079,-0.016129,0.047591,4.74,0.1157895,2.177343,2.256661,1.034544
75%,46137.0,,2020.0,9.0,95.34,1.029836,1.0,1.113424,0.26634,0.11323,0.18696,0.008696,0.027434,0.11,0.157784,10.68,0.2425068,5.800866,5.36785,1.066909


### 2. Split the data and train the model

In [4]:
cut_off_year = 2019

data = data.reset_index(drop=True)
train_data = data[(data['Year'] < cut_off_year) & ((data['Year'] != cut_off_year - 1) | (data['Month'] < 9))]
test_data = data[data['Year'] >= cut_off_year]
x_train = train_data.drop(['Year', 'Buy', 'Month', 'Ticker', 'Result', 'Benchmark SP500 Performance', 'Price', data.columns[0]], axis=1)
y_train = train_data['Buy']
x_test = test_data.drop(['Year', 'Buy', 'Month', 'Ticker', 'Result', 'Benchmark SP500 Performance', 'Price', data.columns[0]], axis=1)
y_test = test_data['Buy']

print(f"Amount of train data: {len(train_data)}")
print(f"Amount of test data: {len(test_data)}")

scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

Amount of train data: 40199
Amount of test data: 19829


In [5]:
param_dist = {
    'n_estimators': randint(100, 800),
    'max_depth': randint(10, 50),
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10),
    'class_weight': [{0: 3, 1: 1}, {0: 5, 1: 2}, {0: 2, 1: 1}, {0: 8, 1: 5}, {0: 6, 1: 5}, {0: 5, 1: 6}, {0: 1, 1: 2}, None]
}
random_search = RandomizedSearchCV(estimator=RandomForestClassifier(random_state=42),
                                   param_distributions=param_dist,
                                   n_iter=20, 
                                   cv=5,
                                   scoring='precision',
                                   n_jobs=-1,
                                   verbose=2,
                                   random_state=42)
random_search.fit(x_train, y_train)
best_model = random_search.best_estimator_
print("Best Parameters:", random_search.best_params_)

y_pred = best_model.predict(x_test)

Fitting 5 folds for each of 20 candidates, totalling 100 fits
Best Parameters: {'class_weight': {0: 2, 1: 1}, 'max_depth': 32, 'min_samples_leaf': 8, 'min_samples_split': 5, 'n_estimators': 763}


### 3. Evaluate the model

In [6]:
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print('Confusion Matrix:')
print(conf_matrix)
print('Classification Report:')
print(class_report)

Accuracy: 0.5931716173281557
Precision: 0.47116564417177914
Confusion Matrix:
[[11378   431]
 [ 7636   384]]
Classification Report:
              precision    recall  f1-score   support

           0       0.60      0.96      0.74     11809
           1       0.47      0.05      0.09      8020

    accuracy                           0.59     19829
   macro avg       0.53      0.51      0.41     19829
weighted avg       0.55      0.59      0.47     19829



In [7]:
test_results = test_data.copy()
test_results['Predicted_Buy'] = y_pred

predicted_stocks_to_buy = test_results[test_results['Predicted_Buy'] == 1]
predicted_stocks_avg_return = predicted_stocks_to_buy['Result'].mean()

avg_stock_return = test_results['Result'].mean()

best_stocks = test_results[test_results['Buy'] == 1]
avg_best_stocks_return = best_stocks['Result'].mean()

sp500_return = predicted_stocks_to_buy['Benchmark SP500 Performance'].mean()

print("Benchmarks: ")
print(f"Average stock return (whole test sample): {avg_stock_return:.5f}")
print(f"'Buy' stocks average return: {avg_best_stocks_return:.5f}")
print(f"SP500 return: {sp500_return:.5f}")

print(f"\nModel's predicted stock average return: {predicted_stocks_avg_return:.5f}")

Benchmarks: 
Average stock return (whole test sample): 1.03403
'Buy' stocks average return: 1.16015
SP500 return: 1.03479

Model's predicted stock average return: 1.05352


### 4. Conclusions
As you can see, although the model has a lower average 3-month ROI from a single company, I believe it is better because it outperformed the benchmark, which was the return from the S&P 500 during the same investment periods. Despite its weak precision, I think a profitable investment strategy can be built from this model after refining other aspects.

### 5. Simple example strategy - backtest

In [8]:
available_cash = 1000000
portfolio_worth = 1000000
current_buys = {}

backtest_data = test_results.copy()
backtest_data = backtest_data.sort_values(by=['Year', 'Month'])

def sell_stock(ticker, price):
    prev_price = current_buys[ticker]['price']
    amount = current_buys[ticker]['shares']

    global portfolio_worth, available_cash
    portfolio_worth -= prev_price * amount
    portfolio_worth += price * amount
    available_cash += price * amount

for index, row in backtest_data.iterrows():
    ticker = row['Ticker']
    prediction = row['Predicted_Buy']
    price = row['Price']

    if prediction == True:
        if ticker not in current_buys:
            allowed_spend = int(portfolio_worth / 5)
            
            if allowed_spend > available_cash:
                allowed_spend = available_cash

            if allowed_spend < portfolio_worth / 50:
                continue
                
            amount = int(allowed_spend / price)
            available_cash -= amount * price
            
            current_buys[ticker] = {'price': price, 'shares': amount, 'last_price': price}
            print(f"Added {ticker} to current_buys for {row['Year']}-{row['Month']} with price {price}")
        else:
            if price < current_buys[ticker]['price'] * 0.95: # Stop loss
                prev_price = current_buys[ticker]['price']
                
                sell_stock(ticker, price)
                del current_buys[ticker]
                
                print(f"Removed {ticker} from current_buys for {row['Year']}-{row['Month']} with price {price}; prev price: {prev_price}")
                print(f"New net worth: {portfolio_worth}")
            else:
                current_buys[ticker]['last_price'] = price
            
    else:
        if ticker in current_buys:
            prev_price = current_buys[ticker]['price']

            sell_stock(ticker, price)            
            del current_buys[ticker]
            
            print(f"Removed {ticker} from current_buys for {row['Year']}-{row['Month']} with price {price}; prev price: {prev_price}")
            print(f"New net worth: {portfolio_worth}")

for ticker in current_buys:
    sell_stock(ticker, current_buys[ticker]['last_price'])
    
    print(f"Removed {ticker} from current_buys")
    
print(portfolio_worth)

Added AAL to current_buys for 2019-3 with price 34.19
Added BXP to current_buys for 2019-3 with price 102.66
Added DPZ to current_buys for 2019-3 with price 236.41
Added EXR to current_buys for 2019-3 with price 78.72
Added GEN to current_buys for 2019-3 with price 11.77
Removed AAL from current_buys for 2019-4 with price 31.93; prev price: 34.19
New net worth: 986781.26
Removed BXP from current_buys for 2019-4 with price 106.23; prev price: 102.66
New net worth: 993735.6200000001
Removed EXR from current_buys for 2019-4 with price 84.05; prev price: 78.72
New net worth: 1007273.8200000001
Removed GEN from current_buys for 2019-4 with price 12.12; prev price: 11.77
New net worth: 1013221.02
Added INTU to current_buys for 2019-4 with price 256.32
Added LULU to current_buys for 2019-4 with price 165.52
Added MAR to current_buys for 2019-4 with price 122.53
Added META to current_buys for 2019-4 with price 168.35
Removed DPZ from current_buys for 2019-5 with price 256.62; prev price: 236.4