# Random Forest Classification
This programm is runs Random Forest Classification in order to predict wether stock is a good option to buy. Stock is classified as 'Buy' if it'll beat SP500 and ROI is above 2%.
Randomized Search is used to optimize parameters for model's precision.

### 1. Imports

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix, classification_report, make_scorer, recall_score
from sklearn.preprocessing import StandardScaler
from scipy.stats import randint

### 2. Load the data

In [2]:
data = pd.read_csv('stocks_data.csv')
data.describe(include='all')

Unnamed: 0.1,Unnamed: 0,Ticker,Year,Month,MA Ratio,Buy,Result,ROE,Insider Ownership Growth,Institutional Ownership Growth,Forecast EPS Growth,Avg 2Q EPS Growth,Avg 2Q EPS Surprise,YoY EPS Growth,Sector Performance,Market Performance,Benchmark SP500 Performance
count,14854.0,14854,14854.0,14854.0,14854.0,14854.0,14854.0,14854.0,14854.0,14854.0,14854.0,14854.0,14854.0,14854.0,14854.0,14854.0,14854.0
unique,,393,,,,,,,,,,,,,,,
top,,AWK,,,,,,,,,,,,,,,
freq,,55,,,,,,,,,,,,,,,
mean,7426.5,,2020.669113,6.22324,1.004148,0.415848,1.032817,39.494365,0.015486,0.026708,0.057775,0.181477,13.755183,0.369529,1.488003,1.438443,1.029219
std,4288.124784,,1.428016,3.520757,0.046473,0.492884,0.14926,181.839873,0.269863,0.230675,2.136724,2.111809,46.751483,3.637998,8.164589,7.038394,0.077491
min,0.0,,2018.0,1.0,0.580721,0.0,0.259712,-613.743387,-0.633527,-0.714136,-0.992366,-45.05,-65.625,-0.961538,-44.900728,-22.795349,0.769903
25%,3713.25,,2019.0,3.0,0.977766,0.0,0.944153,10.160854,-0.00135,-0.023114,-0.184264,-0.040838,2.015,0.017606,-3.453784,-3.160007,0.983314
50%,7426.5,,2021.0,6.0,1.00536,0.0,1.028547,19.251991,0.0,-0.000648,-0.039062,0.045662,6.055,0.130688,1.496227,2.069271,1.043101
75%,11139.75,,2022.0,9.0,1.031953,1.0,1.113949,31.949569,0.008,0.033653,0.086957,0.154182,13.135,0.275148,6.429508,5.50743,1.080728


### 3. Split the data for train and test, standarise the data

In [4]:
data = data.reset_index(drop=True)
train_data = data[data['Year'] <= 2022]
test_data = data[data['Year'] > 2022]
x_train = train_data.drop(['Year', 'Buy', 'Month', 'Ticker', 'Result', 'Benchmark SP500 Performance', data.columns[0]], axis=1)
y_train = train_data['Buy']
x_test = test_data.drop(['Year', 'Buy', 'Month', 'Ticker', 'Result', 'Benchmark SP500 Performance', data.columns[0]], axis=1)
y_test = test_data['Buy']

scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

### 4. Train the model with randomized search

In [39]:
param_dist = {
    'n_estimators': randint(100, 800),
    'max_depth': randint(10, 50),
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10),
    'class_weight': [{0: 3, 1: 1}, {0: 5, 1: 2}, {0: 2, 1: 1}, {0: 8, 1: 5}, {0: 6, 1: 5}, {0: 5, 1: 6}, {0: 1, 1: 2}, None]
}
random_search = RandomizedSearchCV(estimator=RandomForestClassifier(random_state=42),
                                   param_distributions=param_dist,
                                   n_iter=20, 
                                   cv=5,
                                   scoring='precision',
                                   n_jobs=-1,
                                   verbose=2,
                                   random_state=42)
random_search.fit(x_train, y_train)
best_model = random_search.best_estimator_
print("Best Parameters:", random_search.best_params_)

y_pred = best_model.predict(x_test)

Fitting 5 folds for each of 20 candidates, totalling 100 fits
Best Parameters: {'class_weight': {0: 2, 1: 1}, 'max_depth': 31, 'min_samples_leaf': 5, 'min_samples_split': 3, 'n_estimators': 443}


### 5. Evaluation

In [40]:
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print('Confusion Matrix:')
print(conf_matrix)
print('Classification Report:')
print(class_report)

Accuracy: 0.654434250764526
Precision: 0.4224137931034483
Confusion Matrix:
[[1021   67]
 [ 498   49]]
Classification Report:
              precision    recall  f1-score   support

           0       0.67      0.94      0.78      1088
           1       0.42      0.09      0.15       547

    accuracy                           0.65      1635
   macro avg       0.55      0.51      0.47      1635
weighted avg       0.59      0.65      0.57      1635



In [41]:
test_results = test_data.copy()
test_results['Predicted_Buy'] = y_pred

predicted_stocks_to_buy = test_results[test_results['Predicted_Buy'] == 1]
predicted_stocks_avg_return = predicted_stocks_to_buy['Result'].mean()

avg_stock_return = test_results['Result'].mean()

best_stocks = test_results[test_results['Buy'] == 1]
avg_best_stocks_return = best_stocks['Result'].mean()

sp500_return = predicted_stocks_to_buy['Benchmark SP500 Performance'].mean()

print("Benchmarks: ")
print(f"Average stock return (whole test sample): {avg_stock_return:.5f}")
print(f"'Buy' stocks average return: {avg_best_stocks_return:.5f}")
print(f"SP500 return: {sp500_return:.5f}")

print(f"\nModel's predicted stock average return: {predicted_stocks_avg_return:.5f}")

Benchmarks: 
Average stock return (whole test sample): 1.03164
'Buy' stocks average return: 1.17566
SP500 return: 1.05889

Model's predicted stock average return: 1.05902


### 6. Conclusion

The model demonstrates some promising aspects, although there is room for improvement. The precision for the positive class (class 1) is 42.24%, which, while not ideal, indicates that when the model does identify a positive case, there is a reasonable likelihood of success. Additionally, the model's predicted stock average return of 1.05902 slightly outperforms the S&P 500 return of 1.05889, showing that the model has potential in generating competitive returns.

To further capitalize on this potential, enhancing recall for the positive class through more data, feature engineering, or alternative modeling approaches could be beneficial. By refining these areas, the model could become a more reliable tool for developing an effective investment strategy.