Mount The Google Drive to read the sp500csv file.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Command to verfiy if the file exists or not

In [None]:
!ls "/content/drive/My Drive/Major Project Sem 8/Mid Sem Eval/sp500.csv"

ls: cannot access '/content/drive/My Drive/Major Project Sem 8/Mid Sem Eval/sp500.csv': No such file or directory


# INTRODUCTION



Predictin the stock market using machine learning.
Data is taken from S&P500 index, which is a stock market index tracking the stock performance of 500 of the largest companies listed on stock exchanges in the United States.
Overview of the work done in this notebook.
1. Clean the data.
2. Train the model.
3. Backtesting(To find how well our model works)
4. Add more predictors to improve our accuracy.
5. Pointers to improve the model further.

In the end we'll predit the tomorrows price on the basis of previous data.

A stock, also known as a share or equity, represents ownership in a specific company. When you buy a stock, you essentially become a shareholder in that company, which means you own a portion of the company's assets.  
An index is a statistical measure of the performance of a basket of stocks or securities representing a particular market or sector. It is used to track and benchmark the overall performance of the market or a specific segment of the market.  
Investors can buy and sell stocks, but they cannot directly buy or sell an index.

# Downloading The Data


In [None]:
import yfinance as yf # yahoo finance api to download daily stock prices
import pandas as pd
import os

In [None]:
file_path = '/content/drive/My Drive/Major Project Sem 8/Mid Sem Eval/sp500.csv'
if len(file_path):
    sp500 = pd.read_csv(file_path, index_col=0)
else:
    sp500 = yf.Ticker("^GSPC") # GSPC symbol for SandP index in yahoo api
    sp500 = sp500.history(period="max")#  querying all data from the beginning, gives out a pandas data frame
    sp500.to_csv("sp500.csv")

FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/My Drive/Major Project Sem 8/Mid Sem Eval/sp500.csv'

In [None]:
sp500.index = pd.to_datetime(sp500.index)

## Explaining the Database

Date(index):  
Open: opening price of the day  
high: highest price of the day  
low: lowest   
close: closing   
vol: amount of shares exchanged


In [None]:
sp500

In [None]:
sp500.plot.line(y="Close", use_index=True) # show the closing price over the years

# Cleaning The Data

These are more appropriate for individual stocks and not an index so these are irrelevant to us. Hence removed.

In [None]:
del sp500["Dividends"]
del sp500["Stock Splits"]

Setting up the target for prediction.  
For now we are predicting if the price will go up or down tomorrow.  
The other method is to predict the absolute price, but the problem with that even if can very accurately predict the absolute price, but lose money since there is no knowledge about if it's +ve or -ve.   

So for simplifying the problem and taking the 1st step as to figure out if the price will go up or down.  


In [None]:
sp500["Tomorrows_Close"] = sp500["Close"].shift(-1) # to show next day's closing price on a particular day

## Setting Up The Target
Adding a boolean field to know if Tomorrow's closing price was > today's closing price


In [None]:
sp500["Target"] = (sp500["Tomorrows_Close"] > sp500["Close"]).astype(int) # typecasting into int

> Note: It is good to have a data that goes far back in history but going too far back has it's downsides because markets changes after a few decades, or there can be incidents like stock market crashes. For now we are keeping these things out of the scope. ANd hence removeing the data that came before 1990.

In [None]:
sp500 = sp500.loc["1990-01-01":].copy() # take the indexes in range [1990, present]

In [None]:
sp500

#  Training the Machine Learning Model

Random Forest Classifier used because
1. it works by training bunch of individual decision trees with randomized parameters. and then averagin the results from those decision trees.
1. therefore they are resistant to overfitting, can pick up non linear tendencies in the data as well also faster as it paraleely trains different decision trees.
1. non linear relationships

Other possible options are SVM, Support vector machine.
Gradient boosting trees can be more accurate than random forests. Because we train them to correct each other's errors, they're capable of capturing complex patterns in the data. However, if the data are noisy, the boosted trees may overfit and start modeling the noise
SVM is a powerful algorithm for both classification and regression tasks. It works well in high-dimensional spaces and is effective in cases where the decision boundary is not necessarily linear.

In [None]:
from sklearn.ensemble import RandomForestClassifier
# n_estimators=no. of decision trees(try increasing), min_samples = protection against overfitting, if we rerun the model again we get the same results
model = RandomForestClassifier(n_estimators=100, min_samples_split=100, random_state=1)



## Splitting the data into Train and test set

Done to prevent leakage of data into the model. Like predicting the result of the data that we have already used to train that can give good results on the test data but on new data the results will be very bad. The model should not have any information about the future to predict hte future.

In [None]:
train = sp500.iloc[:-100] # all rows except the last 100 into train set
test = sp500.iloc[-100:] # last 100 into test set
# Not using the Tomorrows_closing and even the target column in the train set,
# since it wont be available in real time prediction
predictors = ["Close", "Volume", "Open", "High", "Low"]
# use the predictors to predict the Target
model.fit(train[predictors], train["Target"])

## Checking the Accuracy of the model


In [None]:
from sklearn.metrics import precision_score

preds = model.predict(test[predictors])
preds = pd.Series(preds, index=test.index)
# print(preds)
precision_score(test["Target"], preds)

> Not a good precision score.

### Visulaising the predictions
compairing the Target in the test set and the preds made from the model on the test set.


In [None]:
combined = pd.concat([test["Target"], preds], axis=1)
combined.plot()

In [None]:
print(combined)

# Backtesting the algorithm

So that model cover up some cases and be can be more confident in predicting the prices

In [None]:
# complining all the steps above into a single function
def predict(train, test, predictors, model):
    model.fit(train[predictors], train["Target"])
    preds = model.predict(test[predictors])
    preds = pd.Series(preds, index=test.index, name="Predictions")
    combined = pd.concat([test["Target"], preds], axis=1)
    return combined

In [None]:
# take 10 years of data since 1 tradin year has 250 days, train the model
# 1st model trained on 10 yrs data and then next model training in steps of 1 year
# train 10 yrs -> predict 11th year
# train 11 yrs -> predict 12th yr
# train 12 yrs -> predict 13th yr
def backtest(data, model, predictors, start=2500, step=250):
    all_predictions = []

    for i in range(start, data.shape[0], step):
        train = data.iloc[0:i].copy()
        test = data.iloc[i:(i+step)].copy()
        predictions = predict(train, test, predictors, model)
        all_predictions.append(predictions)

    return pd.concat(all_predictions)

In [None]:
predictions = backtest(sp500, model, predictors)

In [None]:
predictions["Predictions"].value_counts()

In [None]:
# comparing the target to precdictions
precision_score(predictions["Target"], predictions["Predictions"])

In [None]:
new_combined = pd.concat([predictions["Target"], predictions["Predictions"]], axis=1)
new_combined.plot()

Right Now the algo is not performing good.

In [None]:
print(new_combined)

In [None]:
# checking if the market went up or down
# predictions["Target"].value_counts() / predictions.shape[0]

## Adding more predictors to the Model

horizons to look at while predicting

the code is creating a new column in the DataFrame or Series (sp500) called trend_column. This column contains the sum of values in the "Target" column, but each value is the sum of the previous horizon values, effectively creating a rolling sum with a specified window size. The .shift(1) is used to lag the "Target" values by one period before calculating the rolling sum.

In [None]:
horizons = [2,5,60,250,1000]
new_predictors = []

for horizon in horizons:
    rolling_averages = sp500.rolling(horizon).mean()

    ratio_column = f"Close_Ratio_{horizon}"
    sp500[ratio_column] = sp500["Close"] / rolling_averages["Close"]

    trend_column = f"Trend_{horizon}"
    sp500[trend_column] = sp500.shift(1).rolling(horizon).sum()["Target"]

    new_predictors+= [ratio_column, trend_column]

In [None]:
sp500 = sp500.dropna(subset=sp500.columns[sp500.columns != "Tomorrows_Close"])

In [None]:
sp500

Training the model again. increasing the estimators and reducing the sample split

In [None]:
model = RandomForestClassifier(n_estimators=200, min_samples_split=50, random_state=1)

In [None]:
def predict(train, test, predictors, model):
    model.fit(train[predictors], train["Target"])
    preds = model.predict_proba(test[predictors])[:,1] #return the probability of stock price going up or down
    # by deafult the polarising point is 0.5 but to improve the accuracy and make the model more strict, the
    # turning point is moved to 0.6 instead
    preds[preds >=.6] = 1
    preds[preds <.6] = 0
    preds = pd.Series(preds, index=test.index, name="Predictions")
    combined = pd.concat([test["Target"], preds], axis=1)
    return combined

instead of using the open close etc, using the new predictors to get better idea of the price

In [None]:
predictions = backtest(sp500, model, new_predictors)

1.0 dec beacuse of the change in polarity point

In [None]:
predictions["Predictions"].value_counts()

In [None]:
precision_score(predictions["Target"], predictions["Predictions"])

WIth jsut looking at the time series data and only the stock prices, the result is good for a base work.

In [None]:
predictions["Target"].value_counts() / predictions.shape[0]

In [None]:
predictions

# Possible Next Steps

1. exchanges that are open overnight, so they are trading a few hours earlier than the sp500 so using that data to improve the predicitons
1. maybe integrating news
1. adding special sectors and predicting on that basis, like how is tech performing, how is oil performing etc
1. maybe incrase the resultion to the data, hour basis maybe