<h1 style="text-align: center;">Random Forest classifier for predicting bullish or bearish market trends based on the signals</h1>

#### Loading the Data

In [14]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import LabelEncoder

# Load dataset
data = pd.read_excel('SO_Stock_Analysis.xlsx')

# Display the first few rows of the dataframe to understand its structure
print(data.head())


        Date  Price   Open   High    Low   Vol.  Change %   SMA20    SMA50  \
0 2024-01-02  70.85  69.67  70.96  69.60  3.96M      1.04  70.229  68.8206   
1 2024-01-03  72.24  71.00  72.30  70.99  6.29M      1.96  70.229  68.8206   
2 2024-01-04  71.71  72.31  72.54  71.58  3.44M     -0.73  70.229  68.8206   
3 2024-01-05  71.61  71.74  71.85  70.71  5.67M     -0.14  70.229  68.8206   
4 2024-01-08  72.16  71.50  72.18  71.18  3.30M      0.77  70.229  68.8206   

    SMA200 Strong_Buy  Buy_Signal Strong_Sell  Sell_Signal  RSI_manual  \
0  77.6633        NaN         NaN         NaN          NaN         NaN   
1  77.6633        NaN         NaN         NaN          NaN         NaN   
2  77.6633        NaN         NaN         NaN          NaN         NaN   
3  77.6633        NaN         NaN         NaN          NaN         NaN   
4  77.6633        NaN         NaN         NaN          NaN         NaN   

       EMA12      EMA26      MACD  MACD_signal  MACD_hist  
0  70.850000  70.850000  0

In [17]:
# Check for NaN values in each column
nan_counts = data.isna().sum()
print(nan_counts)

Date             0
Price            0
Open             0
High             0
Low              0
Vol.             0
Change %         0
SMA20            0
SMA50            0
SMA200           0
Strong_Buy     245
Buy_Signal     285
Strong_Sell    276
Sell_Signal    285
RSI_manual      13
EMA12            0
EMA26            0
MACD             0
MACD_signal      0
MACD_hist        0
Target           0
dtype: int64


In [18]:
from sklearn.impute import SimpleImputer

# Convert signal columns to numerical for easier processing (e.g., 1 for signal, 0 for no signal)
signal_columns = ['Strong_Buy', 'Buy_Signal', 'Strong_Sell', 'Sell_Signal']
data[signal_columns] = data[signal_columns].notna().astype(int)

# For RSI_manual, impute missing values with the median or another statistic
# It's crucial to handle this carefully as RSI is a critical indicator
imputer = SimpleImputer(strategy='median')
data['RSI_manual'] = imputer.fit_transform(data[['RSI_manual']])

# Check results
print(data[signal_columns].head())
print(data['RSI_manual'].head())

   Strong_Buy  Buy_Signal  Strong_Sell  Sell_Signal
0           0           0            0            0
1           0           0            0            0
2           0           0            0            0
3           0           0            0            0
4           0           0            0            0
0    53.763297
1    53.763297
2    53.763297
3    53.763297
4    53.763297
Name: RSI_manual, dtype: float64


#### Preprocessing Data

In [19]:
import numpy as np
from sklearn.impute import SimpleImputer

# Preprocess Volume: Convert 'M' to millions
data['Vol.'] = data['Vol.'].replace({'M': '*1e6', 'K': '*1e3'}, regex=True).map(pd.eval).astype(float)

# We create a single column 'Market_Trend' to use as the label
conditions = [
    (data['Strong_Buy'] == 'Strong Buy') | (data['Buy_Signal'] == 'Buy/Hold'),
    (data['Strong_Sell'] == 'Strong Sell') | (data['Sell_Signal'] == 'Sell')
]
choices = ['Bullish', 'Bearish']
data['Target'] = np.select(conditions, choices, default='Neutral')

# Encode the categorical data
data['Target'] = data['Target'].astype('category').cat.codes

# Select features
features = data[['Price', 'Open', 'High', 'Low', 'Vol.', 'SMA20', 'SMA50', 'SMA200', 'EMA12', 'EMA26', 'MACD', 'MACD_signal', 'MACD_hist', 'RSI_manual']]
target = data['Target']

# Handle missing values
imputer = SimpleImputer(strategy='median')
features_imputed = imputer.fit_transform(features)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)


#### Training the Random Forest Model

In [20]:
# Initialize and train the RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

#### Make predictions 

In [21]:
# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
print(classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        57

    accuracy                           1.00        57
   macro avg       1.00      1.00      1.00        57
weighted avg       1.00      1.00      1.00        57

Accuracy: 1.0


Perfect accuracy often suggests that the model might be overfitting the training data, especially when all metrics like precision, recall, and F1-score are also 1.0. It might be overly tuned to the nuances of the training data, failing to generalize to unseen data.

Predicting stock movements involves a multitude of factors that aren't always captured by historical data alone.

Market sentiment, economic indicators like GDP and inflation, geopolitical developments, and specific sector dynamics play critical roles. Government policies and regulations can also significantly impact market behavior. These predictions can be provide some insights, they need to be used with caution and in the context of broader market analysis. 