<a href="https://colab.research.google.com/github/Sathyavrv/Nifty/blob/main/Nifty.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stock Market Prediction using LightGBM on NIFTY Data

## Overview

This project demonstrates a comprehensive approach to predicting stock market movements using NIFTY index data. The focus is on leveraging historical price data from Yahoo Finance and minute-level trading data to train a LightGBM model. The goal is to accurately classify future price movements as either a "buy," "sell," or "wait" signal.

The workflow involves data preprocessing, feature engineering, model training, and evaluation. The model's effectiveness is measured through metrics such as multi-class log loss, accuracy, and classification report. Additionally, feature importance is analyzed to identify the most influential factors driving the model's predictions.

## Key Features

- **Data Preprocessing:** Combining and cleaning daily and minute-level data from Yahoo Finance, calculating Fibonacci levels, and generating additional features based on price differences, volume ratios, and more.
- **Target Determination:** The target labels (buy, sell, wait) are derived based on intraday price movements, with specific thresholds determining the signal.
- **LightGBM Training:** The model is trained using a stratified K-Fold cross-validation approach, optimizing for multi-class log loss and identifying the best iteration through early stopping.
- **Feature Importance Analysis:** Post-training, the top features contributing to model predictions are identified, providing insights into the key drivers of stock price movements.

## Results

The LightGBM model achieved an accuracy of 95% in predicting stock price movements, with detailed classification metrics provided for each class. The top 5 features by aggregated importance are:

1. **Fib_1.5_Low_1_High_2:** Aggregated Feature Importance: ~170,000
2. **Volume_1:** Aggregated Feature Importance: ~160,000
3. **Fib_1.618_Low_1_High_2:** Aggregated Feature Importance: ~140,000
4. **Volume_Sum:** Aggregated Feature Importance: ~130,000
5. **Volume_Difference:** Aggregated Feature Importance: ~120,000

# Imports

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import StandardScaler
import numpy as np
from sklearn.model_selection import KFold
from sklearn.metrics import log_loss, accuracy_score, classification_report
import lightgbm as lgb
from tqdm import tqdm

# Load datas

In [None]:
import yfinance as yf
yf.pdr_override() # <== that's all it takes :-)
from pandas_datareader import data as pdr

nifty_day = pdr.get_data_yahoo('^NSEI', start='2015-01-07',end='2022-10-21')

yfinance: pandas_datareader support is deprecated & semi-broken so will be removed in a future verison. Just use yfinance.


[*********************100%%**********************]  1 of 1 completed


In [None]:
nifty_day

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2015-01-07,8118.649902,8151.200195,8065.450195,8102.100098,8102.100098,164100
2015-01-08,8191.399902,8243.500000,8167.299805,8234.599609,8234.599609,143800
2015-01-09,8285.450195,8303.299805,8190.799805,8284.500000,8284.500000,148000
2015-01-12,8291.349609,8332.599609,8245.599609,8323.000000,8323.000000,103200
2015-01-13,8346.150391,8356.650391,8267.900391,8299.400391,8299.400391,129600
...,...,...,...,...,...,...
2022-10-14,17322.300781,17348.550781,17169.750000,17185.699219,17185.699219,227000
2022-10-17,17144.800781,17328.550781,17098.550781,17311.800781,17311.800781,212200
2022-10-18,17438.750000,17527.800781,17434.050781,17486.949219,17486.949219,239500
2022-10-19,17568.150391,17607.599609,17472.849609,17512.250000,17512.250000,210500


In [None]:
nifty_5min = pd.read_csv("/content/Nifty/src/NIFTY_5min_jan_2015_to_oct_2022.csv")

In [None]:
nifty_5min

In [None]:
nifty_min = pd.read_csv("/content/Nifty/src/NIFTY 50 - Minute data.csv")

In [None]:
nifty_min

# Preprocess

In [None]:
# Function to clean data and add new columns
def clean_data(minute_data, five_min_data):
    # Standardize date format and remove timezone and volume column
    minute_data['date'] = pd.to_datetime(minute_data['date']).dt.tz_localize(None)
    minute_data = minute_data.drop(columns=['volume'])

    five_min_data['date'] = pd.to_datetime(five_min_data['date']).dt.tz_localize(None)
    five_min_data = five_min_data.drop(columns=['volume'])

    # Create a DataFrame for mapping
    mapping = []
    for i in range(1, 5):
        offset = pd.Timedelta(minutes=5-i)
        temp = minute_data.copy()
        temp['five_min_date'] = temp['date'] + offset
        temp = temp[['five_min_date', 'open']]
        temp.columns = ['date', f'{i}']
        mapping.append(temp)

    # Merge the mapping DataFrames with the 5-minute data
    for temp in mapping:
        five_min_data = five_min_data.merge(temp, on='date', how='left')

    # Split the date column into date, day, year, and time into minute, second, hour
    five_min_data['Date'] = five_min_data['date'].dt.date
    five_min_data['Month'] = five_min_data['date'].dt.month
    five_min_data['Day'] = five_min_data['date'].dt.day
    five_min_data['Year'] = five_min_data['date'].dt.year
    five_min_data['Hour'] = five_min_data['date'].dt.hour
    five_min_data['Minute'] = five_min_data['date'].dt.minute
    five_min_data = five_min_data.drop(columns=['date'])

    # Remove the first row from five_min_data
    five_min_data = five_min_data.iloc[1:].reset_index(drop=True)

    return five_min_data

# Apply the function to clean the data
clean_nifty_5min = clean_data(nifty_min, nifty_5min)


In [None]:
clean_nifty_5min

Unnamed: 0,close,high,low,open,1,2,3,4,Date,Month,Day,Year,Hour,Minute
0,8301.00,8303.00,8293.25,8300.50,8292.60,8287.40,8294.25,8300.60,2015-01-09,1,9,2015,9,20
1,8294.15,8302.55,8286.80,8301.65,8300.65,8302.45,8294.85,8295.20,2015-01-09,1,9,2015,9,25
2,8288.50,8295.75,8280.65,8294.10,8295.40,8289.65,8292.30,8290.65,2015-01-09,1,9,2015,9,30
3,8283.45,8290.45,8278.00,8289.10,8289.40,8289.55,8282.75,8283.45,2015-01-09,1,9,2015,9,35
4,8285.55,8288.30,8277.40,8283.40,8284.75,8284.95,8278.95,8282.30,2015-01-09,1,9,2015,9,40
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
136196,17577.60,17577.95,17562.35,17562.35,17567.45,17563.05,17555.40,17560.35,2022-10-21,10,21,2022,15,5
136197,17571.00,17580.95,17570.10,17578.00,17570.90,17568.75,17574.00,17573.75,2022-10-21,10,21,2022,15,10
136198,17579.45,17581.00,17570.75,17571.35,17577.65,17573.65,17580.35,17576.00,2022-10-21,10,21,2022,15,15
136199,17595.20,17595.20,17576.75,17579.40,17577.35,17574.40,17577.90,17577.35,2022-10-21,10,21,2022,15,20


In [None]:
# Function to add columns from nifty_day to clean_five_min_data
def add_day_columns(clean_five_min_data, nifty_day):
    clean_five_min_data['High_1'] = None
    clean_five_min_data['Low_1'] = None
    clean_five_min_data['Volume_1'] = None
    clean_five_min_data['High_2'] = None
    clean_five_min_data['Low_2'] = None
    clean_five_min_data['Volume_2'] = None

    for idx, row in clean_five_min_data.iterrows():
        date = row['Date']
        prev_dates = nifty_day.loc[:date].tail(3).index[:-1]  # Get previous 2 dates
        if len(prev_dates) == 2:
            clean_five_min_data.at[idx, 'High_1'] = nifty_day.at[prev_dates[-1], 'High']
            clean_five_min_data.at[idx, 'Low_1'] = nifty_day.at[prev_dates[-1], 'Low']
            clean_five_min_data.at[idx, 'Volume_1'] = nifty_day.at[prev_dates[-1], 'Volume']
            clean_five_min_data.at[idx, 'High_2'] = nifty_day.at[prev_dates[-2], 'High']
            clean_five_min_data.at[idx, 'Low_2'] = nifty_day.at[prev_dates[-2], 'Low']
            clean_five_min_data.at[idx, 'Volume_2'] = nifty_day.at[prev_dates[-2], 'Volume']

    return clean_five_min_data

# Apply the function to add day columns
clean_nifty_5min = add_day_columns(clean_nifty_5min, nifty_day)


In [None]:
clean_nifty_5min

Unnamed: 0,close,high,low,open,1,2,3,4,Date,Month,Day,Year,Hour,Minute,High_1,Low_1,Volume_1,High_2,Low_2,Volume_2
0,8301.00,8303.00,8293.25,8300.50,8292.60,8287.40,8294.25,8300.60,2015-01-09,1,9,2015,9,20,8243.5,8167.299805,143800,8151.200195,8065.450195,164100
1,8294.15,8302.55,8286.80,8301.65,8300.65,8302.45,8294.85,8295.20,2015-01-09,1,9,2015,9,25,8243.5,8167.299805,143800,8151.200195,8065.450195,164100
2,8288.50,8295.75,8280.65,8294.10,8295.40,8289.65,8292.30,8290.65,2015-01-09,1,9,2015,9,30,8243.5,8167.299805,143800,8151.200195,8065.450195,164100
3,8283.45,8290.45,8278.00,8289.10,8289.40,8289.55,8282.75,8283.45,2015-01-09,1,9,2015,9,35,8243.5,8167.299805,143800,8151.200195,8065.450195,164100
4,8285.55,8288.30,8277.40,8283.40,8284.75,8284.95,8278.95,8282.30,2015-01-09,1,9,2015,9,40,8243.5,8167.299805,143800,8151.200195,8065.450195,164100
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
136196,17577.60,17577.95,17562.35,17562.35,17567.45,17563.05,17555.40,17560.35,2022-10-21,10,21,2022,15,5,17607.599609,17472.849609,210500,17527.800781,17434.050781,239500
136197,17571.00,17580.95,17570.10,17578.00,17570.90,17568.75,17574.00,17573.75,2022-10-21,10,21,2022,15,10,17607.599609,17472.849609,210500,17527.800781,17434.050781,239500
136198,17579.45,17581.00,17570.75,17571.35,17577.65,17573.65,17580.35,17576.00,2022-10-21,10,21,2022,15,15,17607.599609,17472.849609,210500,17527.800781,17434.050781,239500
136199,17595.20,17595.20,17576.75,17579.40,17577.35,17574.40,17577.90,17577.35,2022-10-21,10,21,2022,15,20,17607.599609,17472.849609,210500,17527.800781,17434.050781,239500


In [None]:
# Function to calculate Fibonacci levels
def calculate_fibonacci_levels(df):
    fib_ratios = [0.5, 0.618, 1.5, 1.618]
    for ratio in fib_ratios:
        df[f'Fib_{ratio}_H1_L1'] = df['High_1'] - (df['High_1'] - df['Low_1']) * ratio
        df[f'Fib_{ratio}_H1_L2'] = df['High_1'] - (df['High_1'] - df['Low_2']) * ratio
        df[f'Fib_{ratio}_H2_L1'] = df['High_2'] - (df['High_2'] - df['Low_1']) * ratio
        df[f'Fib_{ratio}_H2_L2'] = df['High_2'] - (df['High_2'] - df['Low_2']) * ratio

    return df

# Apply the function to calculate Fibonacci levels
clean_nifty_5min = calculate_fibonacci_levels(clean_nifty_5min)

In [None]:
clean_nifty_5min

Unnamed: 0,close,high,low,open,1,2,3,4,Date,Month,...,Fib_0.618_H2_L1,Fib_0.618_H2_L2,Fib_1.5_H1_L1,Fib_1.5_H1_L2,Fib_1.5_H2_L1,Fib_1.5_H2_L2,Fib_1.618_H1_L1,Fib_1.618_H1_L2,Fib_1.618_H2_L1,Fib_1.618_H2_L2
0,8301.00,8303.00,8293.25,8300.50,8292.60,8287.40,8294.25,8300.60,2015-01-09,1,...,8161.149754,8098.206695,8129.199707,7976.425293,8175.349609,8022.575195,8120.208084,7955.415416,8177.249363,8012.456695
1,8294.15,8302.55,8286.80,8301.65,8300.65,8302.45,8294.85,8295.20,2015-01-09,1,...,8161.149754,8098.206695,8129.199707,7976.425293,8175.349609,8022.575195,8120.208084,7955.415416,8177.249363,8012.456695
2,8288.50,8295.75,8280.65,8294.10,8295.40,8289.65,8292.30,8290.65,2015-01-09,1,...,8161.149754,8098.206695,8129.199707,7976.425293,8175.349609,8022.575195,8120.208084,7955.415416,8177.249363,8012.456695
3,8283.45,8290.45,8278.00,8289.10,8289.40,8289.55,8282.75,8283.45,2015-01-09,1,...,8161.149754,8098.206695,8129.199707,7976.425293,8175.349609,8022.575195,8120.208084,7955.415416,8177.249363,8012.456695
4,8285.55,8288.30,8277.40,8283.40,8284.75,8284.95,8278.95,8282.30,2015-01-09,1,...,8161.149754,8098.206695,8129.199707,7976.425293,8175.349609,8022.575195,8120.208084,7955.415416,8177.249363,8012.456695
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
136196,17577.60,17577.95,17562.35,17562.35,17567.45,17563.05,17555.40,17560.35,2022-10-21,10,...,17493.840957,17469.863281,17405.474609,17347.276367,17445.374023,17387.175781,17389.574109,17326.797605,17438.889785,17376.113281
136197,17571.00,17580.95,17570.10,17578.00,17570.90,17568.75,17574.00,17573.75,2022-10-21,10,...,17493.840957,17469.863281,17405.474609,17347.276367,17445.374023,17387.175781,17389.574109,17326.797605,17438.889785,17376.113281
136198,17579.45,17581.00,17570.75,17571.35,17577.65,17573.65,17580.35,17576.00,2022-10-21,10,...,17493.840957,17469.863281,17405.474609,17347.276367,17445.374023,17387.175781,17389.574109,17326.797605,17438.889785,17376.113281
136199,17595.20,17595.20,17576.75,17579.40,17577.35,17574.40,17577.90,17577.35,2022-10-21,10,...,17493.840957,17469.863281,17405.474609,17347.276367,17445.374023,17387.175781,17389.574109,17326.797605,17438.889785,17376.113281


In [None]:
# Function to calculate differences
def calculate_differences(df, base_col, columns):
    for col in columns:
        df[f'Diff_{base_col}_{col}'] = abs(df[base_col] - df[col])
    return df

# Columns to calculate differences
columns_to_diff = ['1', '2', '3', '4', 'High_1', 'Low_1', 'High_2', 'Low_2'] + \
                  [f'Fib_{ratio}_H1_L1' for ratio in [0.5, 0.618, 1.5, 1.618]] + \
                  [f'Fib_{ratio}_H1_L2' for ratio in [0.5, 0.618, 1.5, 1.618]] + \
                  [f'Fib_{ratio}_H2_L1' for ratio in [0.5, 0.618, 1.5, 1.618]] + \
                  [f'Fib_{ratio}_H2_L2' for ratio in [0.5, 0.618, 1.5, 1.618]]

# Apply the function to calculate differences
clean_nifty_5min = calculate_differences(clean_nifty_5min, 'open', columns_to_diff)

In [None]:
clean_nifty_5min[:30]

Unnamed: 0,close,high,low,open,1,2,3,4,Date,Month,...,Diff_open_Fib_1.5_H1_L2,Diff_open_Fib_1.618_H1_L2,Diff_open_Fib_0.5_H2_L1,Diff_open_Fib_0.618_H2_L1,Diff_open_Fib_1.5_H2_L1,Diff_open_Fib_1.618_H2_L1,Diff_open_Fib_0.5_H2_L2,Diff_open_Fib_0.618_H2_L2,Diff_open_Fib_1.5_H2_L2,Diff_open_Fib_1.618_H2_L2
0,8301.0,8303.0,8293.25,8300.5,8292.6,8287.4,8294.25,8300.6,2015-01-09,1,...,324.074707,345.084584,141.25,139.350246,125.150391,123.250637,192.174805,202.293305,277.924805,288.043305
1,8294.15,8302.55,8286.8,8301.65,8300.65,8302.45,8294.85,8295.2,2015-01-09,1,...,325.224707,346.234584,142.4,140.500246,126.300391,124.400637,193.324805,203.443305,279.074805,289.193305
2,8288.5,8295.75,8280.65,8294.1,8295.4,8289.65,8292.3,8290.65,2015-01-09,1,...,317.674707,338.684584,134.85,132.950246,118.750391,116.850637,185.774805,195.893305,271.524805,281.643305
3,8283.45,8290.45,8278.0,8289.1,8289.4,8289.55,8282.75,8283.45,2015-01-09,1,...,312.674707,333.684584,129.85,127.950246,113.750391,111.850637,180.774805,190.893305,266.524805,276.643305
4,8285.55,8288.3,8277.4,8283.4,8284.75,8284.95,8278.95,8282.3,2015-01-09,1,...,306.974707,327.984584,124.15,122.250246,108.050391,106.150637,175.074805,185.193305,260.824805,270.943305
5,8283.75,8287.65,8278.05,8285.4,8284.95,8286.1,8280.55,8277.75,2015-01-09,1,...,308.974707,329.984584,126.15,124.250246,110.050391,108.150637,177.074805,187.193305,262.824805,272.943305
6,8276.25,8284.25,8273.95,8283.8,8281.4,8282.8,8279.75,8278.65,2015-01-09,1,...,307.374707,328.384584,124.55,122.650246,108.450391,106.550637,175.474805,185.593305,261.224805,271.343305
7,8282.0,8283.6,8275.05,8275.95,8275.5,8278.95,8279.05,8278.7,2015-01-09,1,...,299.524707,320.534584,116.7,114.800246,100.600391,98.700637,167.624805,177.743305,253.374805,263.493305
8,8285.5,8287.35,8281.7,8281.8,8275.15,8279.45,8283.05,8279.6,2015-01-09,1,...,305.374707,326.384584,122.55,120.650246,106.450391,104.550637,173.474805,183.593305,259.224805,269.343305
9,8280.3,8286.4,8279.95,8285.55,8283.75,8284.6,8286.35,8284.4,2015-01-09,1,...,309.124707,330.134584,126.3,124.400246,110.200391,108.300637,177.224805,187.343305,262.974805,273.093305


In [None]:
def determine_targets(df):
    df['target'] = 'wait'  # Default value

    for i in range(len(df)):
        current_open = df.at[i, 'open']
        current_date = df.at[i, 'Date']
        for j in range(i+1, len(df)):
            if df.at[j, 'Date'] != current_date:
                break
            if df.at[j, 'high'] - current_open > 50:
                df.at[i, 'target'] = 'buy'
                break
            if df.at[j, 'low'] - current_open < -50:
                df.at[i, 'target'] = 'sell'
                break

    return df

# Apply the function to determine targets
clean_nifty_5min = determine_targets(clean_nifty_5min)

In [None]:
clean_nifty_5min

Unnamed: 0,close,high,low,open,1,2,3,4,Date,Month,...,Diff_open_Fib_1.618_H1_L2,Diff_open_Fib_0.5_H2_L1,Diff_open_Fib_0.618_H2_L1,Diff_open_Fib_1.5_H2_L1,Diff_open_Fib_1.618_H2_L1,Diff_open_Fib_0.5_H2_L2,Diff_open_Fib_0.618_H2_L2,Diff_open_Fib_1.5_H2_L2,Diff_open_Fib_1.618_H2_L2,target
0,8301.00,8303.00,8293.25,8300.50,8292.60,8287.40,8294.25,8300.60,2015-01-09,1,...,345.084584,141.25,139.350246,125.150391,123.250637,192.174805,202.293305,277.924805,288.043305,sell
1,8294.15,8302.55,8286.80,8301.65,8300.65,8302.45,8294.85,8295.20,2015-01-09,1,...,346.234584,142.4,140.500246,126.300391,124.400637,193.324805,203.443305,279.074805,289.193305,sell
2,8288.50,8295.75,8280.65,8294.10,8295.40,8289.65,8292.30,8290.65,2015-01-09,1,...,338.684584,134.85,132.950246,118.750391,116.850637,185.774805,195.893305,271.524805,281.643305,sell
3,8283.45,8290.45,8278.00,8289.10,8289.40,8289.55,8282.75,8283.45,2015-01-09,1,...,333.684584,129.85,127.950246,113.750391,111.850637,180.774805,190.893305,266.524805,276.643305,sell
4,8285.55,8288.30,8277.40,8283.40,8284.75,8284.95,8278.95,8282.30,2015-01-09,1,...,327.984584,124.15,122.250246,108.050391,106.150637,175.074805,185.193305,260.824805,270.943305,sell
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
136196,17577.60,17577.95,17562.35,17562.35,17567.45,17563.05,17555.40,17560.35,2022-10-21,10,...,235.552395,62.024805,68.509043,116.975977,123.460215,81.424219,92.486719,175.174219,186.236719,wait
136197,17571.00,17580.95,17570.10,17578.00,17570.90,17568.75,17574.00,17573.75,2022-10-21,10,...,251.202395,77.674805,84.159043,132.625977,139.110215,97.074219,108.136719,190.824219,201.886719,wait
136198,17579.45,17581.00,17570.75,17571.35,17577.65,17573.65,17580.35,17576.00,2022-10-21,10,...,244.552395,71.024805,77.509043,125.975977,132.460215,90.424219,101.486719,184.174219,195.236719,wait
136199,17595.20,17595.20,17576.75,17579.40,17577.35,17574.40,17577.90,17577.35,2022-10-21,10,...,252.602395,79.074805,85.559043,134.025977,140.510215,98.474219,109.536719,192.224219,203.286719,wait


In [None]:
def calculate_volume_difference(row):
    """
    Calculate the difference between volume_1 and volume_2 for a row in a DataFrame.

    Parameters:
    row (pd.Series): A row of data from the DataFrame.

    Returns:
    int/float: The difference between volume_1 and volume_2.
    """
    return row['Volume_1'] - row['Volume_2']

def calculate_percentage_change(row):
    if row['Volume_2'] == 0:
        return 0  # To avoid division by zero
    return ((row['Volume_1'] - row['Volume_2']) / row['Volume_2']) * 100

def calculate_volume_ratio(row):
    if row['Volume_2'] == 0:
        return 0  # To avoid division by zero
    return row['Volume_1'] / row['Volume_2']

def calculate_volume_sum(row):
    return row['Volume_1'] + row['Volume_2']

def calculate_high_low_difference(row):
    return row['Current_High'] - row['Current_Low']
# Apply the function to each row in the DataFrame
clean_nifty_5min['Volume_Difference'] = clean_nifty_5min.apply(calculate_volume_difference, axis=1)
clean_nifty_5min['Volume_Percentage_Change'] = clean_nifty_5min.apply(calculate_percentage_change, axis=1)
clean_nifty_5min['Volume_Ratio'] = clean_nifty_5min.apply(calculate_volume_ratio, axis=1)
clean_nifty_5min['Volume_Sum'] = clean_nifty_5min.apply(calculate_volume_sum, axis=1)
# Apply the function to each row in the DataFrame
clean_nifty_5min['Volume_Difference'] = clean_nifty_5min.apply(calculate_volume_difference, axis=1)
clean_nifty_5min['High_Low_Difference'] = clean_nifty_5min.apply(calculate_volume_difference, axis=1)

# Drop rows with NaN values created by rolling mean calculation
clean_nifty_5min = clean_nifty_5min.dropna()

# Model Implementations

In [None]:
seed = 42
fold_num = 5

kfold = KFold(n_splits=fold_num, random_state=seed, shuffle=True)
for fold_id, (_, val_index) in enumerate(kfold.split(clean_nifty_5min)):
    clean_nifty_5min.loc[val_index, "fold_id"] = fold_id
clean_nifty_5min["fold_id"] = clean_nifty_5min["fold_id"].astype(int)

def train_lgbm(
    train_data: pd.DataFrame,
    feature_columns: list,
    num_class: int,
    target_column: str,
    params: dict,
    num_boost_round: int = 1000
):

    models = []
    X_train = train_data[feature_columns].values
    y_train = train_data[target_column]

    valid_pred = np.zeros((y_train.shape[0], num_class))

    print("feature shape:", X_train.shape)
    callbacks = [
        lgb.early_stopping(stopping_rounds=50),
        lgb.log_evaluation(period=50)
    ]
    fold_num = train_data["fold_id"].max() + 1
    for fold_id in tqdm(range(fold_num)):
        train_index = train_data.query(f"fold_id!={fold_id}").index
        val_index = train_data.query(f"fold_id=={fold_id}").index
        x_tr, x_val = X_train[train_index], X_train[val_index]
        y_tr, y_val = y_train.values[train_index], y_train.values[val_index]
        lgb_train = lgb.Dataset(x_tr, y_tr)
        lgb_eval = lgb.Dataset(x_val, y_val, reference=lgb_train)
        gbm = lgb.train(
            params,
            lgb_train,
            num_boost_round=num_boost_round,
            valid_sets=[lgb_train, lgb_eval],
            feature_name=feature_columns,
            callbacks=callbacks
        )
        models.append(gbm)
        tmp_val_pred = gbm.predict(x_val)
        valid_pred[val_index] = tmp_val_pred

    return valid_pred, models

num_class = 3
params = {
    "boosting_type": "gbdt",
    "objective": "multiclass",
    "metric": {"multi_logloss"},
    "num_class": num_class,
    "num_leaves": 63,
    "learning_rate": 0.5,
    "feature_fraction": 0.9,
    "bagging_fraction": 0.8,
    "bagging_freq": 5,
    "force_col_wise": False,
    "verbose": -1
}

# Excluded features based on the provided columns
excluded_features = ['Date','1','2','3','4','target','Day', 'Month','high', 'low', 'close','Year', 'Hour', 'Minute','fold_id']
features = [col for col in clean_nifty_5min.columns if col not in excluded_features]

target = "target"

# Train the model
lgbm_valid_pred, models = train_lgbm(clean_nifty_5min, features, num_class, target, params)

# Calculate the predicted classes from the valid predictions
valid_pred_classes = np.argmax(lgbm_valid_pred, axis=1)

# Calculate accuracy
accuracy = accuracy_score(clean_nifty_5min[target], valid_pred_classes)
print("Accuracy:", accuracy)

# Classification report
report = classification_report(clean_nifty_5min[target], valid_pred_classes)
print("Classification Report:\n", report)



## Training Output

**Feature shape:** (171,536, 231)

### Training Progress

1. **Fold 1:**
   - **Iteration 50:** Training's multi_logloss: 0.190518, Validation's multi_logloss: 0.244118
   - **Iteration 100:** Training's multi_logloss: 0.0899947, Validation's multi_logloss: 0.165014
   - **Iteration 150:** Training's multi_logloss: 0.0482171, Validation's multi_logloss: 0.13816
   - **Iteration 200:** Training's multi_logloss: 0.027825, Validation's multi_logloss: 0.130052
   - **Iteration 239:** Training's multi_logloss: 0.0183686, Validation's multi_logloss: 0.128642 (Best Iteration)

2. **Fold 2:**
   - **Iteration 50:** Training's multi_logloss: 0.190836, Validation's multi_logloss: 0.242549
   - **Iteration 66:** Training's multi_logloss: 0.153848, Validation's multi_logloss: 0.212219 (Best Iteration)

3. **Fold 3:**
   - **Iteration 50:** Training's multi_logloss: 0.187178, Validation's multi_logloss: 0.242658
   - **Iteration 84:** Training's multi_logloss: 0.111893, Validation's multi_logloss: 0.183826 (Best Iteration)

4. **Fold 4:**
   - **Iteration 50:** Training's multi_logloss: 0.188349, Validation's multi_logloss: 0.23784
   - **Iteration 100:** Training's multi_logloss: 0.0888571, Validation's multi_logloss: 0.162722
   - **Iteration 150:** Training's multi_logloss: 0.0471114, Validation's multi_logloss: 0.137583
   - **Iteration 200:** Training's multi_logloss: 0.026916, Validation's multi_logloss: 0.131967
   - **Iteration 245:** Training's multi_logloss: 0.0165539, Validation's multi_logloss: 0.131296 (Best Iteration)

5. **Fold 5:**
   - **Iteration 50:** Training's multi_logloss: 0.192043, Validation's multi_logloss: 0.245321
   - **Iteration 100:** Training's multi_logloss: 0.0894215, Validation's multi_logloss: 0.165799
   - **Iteration 150:** Training's multi_logloss: 0.0476919, Validation's multi_logloss: 0.13982
   - **Iteration 200:** Training's multi_logloss: 0.0271293, Validation's multi_logloss: 0.131897
   - **Iteration 218:** Training's multi_logloss: 0.0224201, Validation's multi_logloss: 0.129734 (Best Iteration)

### Final Results

- **Accuracy:** 0.9513

**Classification Report:**

| Class | Precision | Recall | F1-Score | Support |
|-------|-----------|--------|----------|---------|
| 0     | 0.94      | 0.94   | 0.94     | 51,727  |
| 1     | 0.95      | 0.95   | 0.95     | 58,742  |
| 2     | 0.96      | 0.96   | 0.96     | 61,067  |

**Overall:**

- **Accuracy:** 0.95
- **Macro Avg Precision:** 0.95
- **Macro Avg Recall:** 0.95
- **Macro Avg F1-Score:** 0.95
- **Weighted Avg Precision:** 0.95
- **Weighted Avg Recall:** 0.95
- **Weighted Avg F1-Score:** 0.95


In [None]:
best_model = models[0]

In [None]:
import joblib

# Save the model
joblib.dump(best_model, 'lgb_model_june11.pkl')

In [None]:
import matplotlib.pyplot as plt

def plot_aggregated_category_importances(models, categories, importance_type='gain', top_n=50):
    """Plot aggregated feature importances for specified categories.

    Args:
        models (list of lightgbm.Booster): List of trained LightGBM models.
        categories (list of str): List of categories to aggregate feature importances.
        importance_type (str, optional): Type of importance to plot. Defaults to 'gain'.
        top_n (int, optional): Number of top features to display. Defaults to 50.
    """
    # Initialize dictionaries to store importances for the specified categories and other features
    category_importances = {category: 0 for category in categories}
    other_importances = {}

    # Aggregate importances from each model
    for model in models:
        feature_importances = model.feature_importance(importance_type=importance_type)
        feature_names = model.feature_name()

        # Check each feature to determine which category it belongs to, if any, and accumulate importances
        for feature, importance in zip(feature_names, feature_importances):
            matched = False
            for category in categories:
                if category in feature:
                    category_importances[category] += importance
                    matched = True
                    break
            if not matched:  # If the feature does not match any category
                other_importances[feature] = other_importances.get(feature, 0) + importance

    # Combine all importances into one dictionary
    all_importances = {**category_importances, **other_importances}

    # Sort features by their importance
    sorted_features = sorted(all_importances.items(), key=lambda x: x[1], reverse=True)

    # Extract sorted feature names and their importances
    sorted_feature_names, sorted_feature_importances = zip(*sorted_features)

    # Limit to top_n features
    sorted_feature_names = sorted_feature_names[:top_n]
    sorted_feature_importances = sorted_feature_importances[:top_n]

    # Plot the results
    plt.figure(figsize=(12, 8))
    plt.barh(sorted_feature_names, sorted_feature_importances)
    plt.xlabel('Aggregated Feature Importance')
    plt.ylabel('Feature')
    plt.title('Top 50 Aggregated Feature Importance Plot')
    plt.gca().invert_yaxis()  # Invert y-axis to display the most important features on top
    plt.show()

# Assuming `models` is a list of trained LightGBM models
categories =[]

# Use the function to display feature importances
plot_aggregated_category_importances(models, categories)

## Top 5 Features by Aggregated Importance

1. **Fib_1.5_Low_1_High_2**
   - Aggregated Feature Importance: ~170,000

2. **Volume_1**
   - Aggregated Feature Importance: ~160,000

3. **Fib_1.618_Low_1_High_2**
   - Aggregated Feature Importance: ~140,000

4. **Volume_Sum**
   - Aggregated Feature Importance: ~130,000

5. **Volume_Difference**
   - Aggregated Feature Importance: ~120,000
