## ML Based modelling

Training a machine learning model to predict entry and exit points for trading is a complex task that involves multiple steps. Here's a high-level overview of the process:

1. **Data Collection and Preprocessing:**
   - Collect historical price data, volatility, and RSI data for the asset you want to trade.
   - Preprocess the data by cleaning missing values, normalizing/standardizing the features, and splitting it into training and testing sets.

2. **Feature Engineering:**
   - Create relevant features that the model can use for making predictions. This could include lagged price data, moving averages, technical indicators, and other factors that you believe are important for decision-making.

3. **Label Generation:**
   - Define the labels for the training dataset. For a binary classification problem (buy/sell), labels could be generated based on price movement after a certain period. For example, if the price increases by a certain threshold after a few periods, label it as a buy signal.

4. **Model Selection:**
   - Choose a suitable machine learning algorithm for your problem. You could start with algorithms like Random Forest, Gradient Boosting, Support Vector Machines, or Neural Networks.

5. **Training the Model:**
   - Train the model on the training dataset using the chosen algorithm. Use a combination of features (volatility, RSI, etc.) to predict the labels (buy/sell).

6. **Hyperparameter Tuning:**
   - Optimize the hyperparameters of your model to improve its performance. This may involve techniques like grid search or random search.

7. **Validation and Testing:**
   - Evaluate the trained model on the testing dataset to assess its performance. Use metrics like accuracy, precision, recall, and F1-score to measure its effectiveness.

8. **Backtesting:**
   - Implement the trading strategy using the model's predictions. Simulate the trades in historical data to see how well the model would have performed in the past.

9. **Fine-Tuning and Iteration:**
   - Analyze the results of backtesting and refine the model if necessary. Adjust parameters, features, or even try different algorithms.

10. **Forward Testing and Deployment:**
    - Forward test the model on more recent data to ensure its robustness. If satisfied, deploy the model to a live trading environment with proper risk management.

11. **Continuous Monitoring and Updating:**
    - Continuously monitor the model's performance and update it periodically to adapt to changing market conditions.

Please note that developing a successful trading strategy using machine learning is challenging and requires expertise in both trading and data science. Additionally, machine learning models can be prone to overfitting and may not always generalize well to new data. Therefore, it's important to approach this task with caution and consider seeking guidance from experts in both fields.

In [24]:
import yfinance as yf
import pandas as pd
import numpy as np
import talib
from sklearn.model_selection import train_test_split

# Step 1: Data Collection and Preprocessing

def fetch_stock_data(symbol, start_date, end_date):
    data = yf.download(symbol, start=start_date, end=end_date)
    return data

def preprocess_data(data):
    data['CloseDiff'] = data['Close'].diff()
    data.fillna(method='ffill', inplace=True)
    return data

# Step 2: Feature Generation

def generate_technical_indicators(data):
    data['SMA'] = talib.SMA(data['Close'], timeperiod=14)
    data['RSI'] = talib.RSI(data['Close'], timeperiod=14)
    upper_band, middle_band, lower_band = talib.BBANDS(data['Close'], timeperiod=20)
    data['BB_upper'] = upper_band
    data['BB_lower'] = lower_band
    # Add more technical indicators from talib
    return data

# Step 3: Label Generation

def generate_labels(data, period, threshold):
    data['FutureClose'] = data['Close'].shift(-period)
    data['PriceChange'] = (data['FutureClose'] - data['Close']) / data['Close']
    data['Label'] = np.where(data['PriceChange'] >= threshold, 1, 0)
    return data

if __name__ == '__main__':
    symbol = '^NSEI'  # Replace with your desired stock symbol
    start_date = '2016-01-01'
    end_date = '2022-01-01'
    period = 5  # Number of days into the future
    threshold = 0.02  # Price change threshold

    # Fetch and preprocess data
    stock_data = fetch_stock_data(symbol, start_date, end_date)
    preprocessed_data = preprocess_data(stock_data)

    # Generate technical indicators
    data_with_indicators = generate_technical_indicators(preprocessed_data)

    # Generate labels
    labeled_data = generate_labels(data_with_indicators, period, threshold)

    print(labeled_data.head())


[*********************100%***********************]  1 of 1 completed
                   Open         High          Low        Close    Adj Close  \
Date                                                                          
2016-01-04  7924.549805  7937.549805  7781.100098  7791.299805  7791.299805   
2016-01-05  7828.399902  7831.200195  7763.250000  7784.649902  7784.649902   
2016-01-06  7788.049805  7800.950195  7721.200195  7741.000000  7741.000000   
2016-01-07  7673.350098  7674.950195  7556.600098  7568.299805  7568.299805   
2016-01-08  7611.649902  7634.100098  7581.049805  7601.350098  7601.350098   

            Volume   CloseDiff  SMA  RSI  BB_upper  BB_lower  FutureClose  \
Date                                                                        
2016-01-04  134700         NaN  NaN  NaN       NaN       NaN  7563.850098   
2016-01-05  145200   -6.649902  NaN  NaN       NaN       NaN  7510.299805   
2016-01-06  147100  -43.649902  NaN  NaN       NaN       NaN  7562.39

In [25]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV


def train_random_forest(X_train, y_train):
    rf_model = RandomForestClassifier(random_state=42)
    rf_model.fit(X_train, y_train)
    return rf_model

def train_test_split_data(data, features, label):
    X = data[features]
    y = data[label]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    return X_train, X_test, y_train, y_test
def grid_search_model(model, param_grid, X_train, y_train):
    grid_search = GridSearchCV(model, param_grid, cv=3)
    grid_search.fit(X_train, y_train)
    return grid_search.best_estimator_

def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    y_true = y_test
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    roc_auc = roc_auc_score(y_true, y_pred)
    
    print("Confusion Matrix:")
    print(confusion_matrix(y_true, y_pred))
    
    print("Classification Report:")
    print(classification_report(y_true, y_pred))
    
    print(f"Accuracy: {accuracy:.2f}")
    print(f"Precision: {precision:.2f}")
    print(f"Recall: {recall:.2f}")
    print(f"F1-Score: {f1:.2f}")
    print(f"ROC AUC: {roc_auc:.2f}")




#This code completes steps 4 to 7 of the process. 
#It trains a Random Forest model, performs a grid search for hyperparameter tuning,
#and evaluates the model's accuracy. 
#You can further customize the hyperparameter grid and 
#evaluation metrics based on your requirements. 
#Additionally, you can later explore other machine learning algorithms like 
#Gradient Boosting, Support Vector Machines, and Neural Networks for comparison.

In [26]:
# Select features and label for training
features = ['SMA', 'RSI', 'BB_upper', 'BB_lower']  # Add more features here
label = 'Label'

labeled_data = labeled_data.dropna()

# Train-test split
X_train, X_test, y_train, y_test = train_test_split_data(labeled_data, features, label)

# Train a Random Forest model
rf_model = train_random_forest(X_train, y_train)

    # Define hyperparameter grid for Random Forest
rf_param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Perform grid search for Random Forest
best_rf_model = grid_search_model(rf_model, rf_param_grid, X_train, y_train)

# Evaluate the best model
rf_accuracy = evaluate_model(best_rf_model, X_test, y_test)
print(f"Random Forest Accuracy: {rf_accuracy:.2f}")

Random Forest Accuracy: 0.80


In [30]:
def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    y_true = y_test
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    roc_auc = roc_auc_score(y_true, y_pred)
    
    print("Confusion Matrix:")
    print(confusion_matrix(y_true, y_pred))
    
    print("Classification Report:")
    print(classification_report(y_true, y_pred))
    
    print(f"Accuracy: {accuracy:.2f}")
    print(f"Precision: {precision:.2f}")
    print(f"Recall: {recall:.2f}")
    print(f"F1-Score: {f1:.2f}")
    print(f"ROC AUC: {roc_auc:.2f}")
    
evaluate_model(best_rf_model, X_test, y_test)

Confusion Matrix:
[[217  18]
 [ 39  17]]
Classification Report:
              precision    recall  f1-score   support

           0       0.85      0.92      0.88       235
           1       0.49      0.30      0.37        56

    accuracy                           0.80       291
   macro avg       0.67      0.61      0.63       291
weighted avg       0.78      0.80      0.79       291

Accuracy: 0.80
Precision: 0.49
Recall: 0.30
F1-Score: 0.37
ROC AUC: 0.61


In [27]:
labeled_data[labeled_data.Label==1].count()

Open           264
High           264
Low            264
Close          264
Adj Close      264
Volume         264
CloseDiff      264
SMA            264
RSI            264
BB_upper       264
BB_lower       264
FutureClose    264
PriceChange    264
Label          264
dtype: int64

In [28]:
labeled_data[labeled_data.Label==0].count()

Open           1188
High           1188
Low            1188
Close          1188
Adj Close      1188
Volume         1188
CloseDiff      1188
SMA            1188
RSI            1188
BB_upper       1188
BB_lower       1188
FutureClose    1188
PriceChange    1188
Label          1188
dtype: int64