Stock market prediction is the act of trying to determine the future value of a company stock or other financial instrument traded on an exchange. The successful prediction of a stock's future price could yield significant profit. The efficient market hypothesis posits that stock prices are a function of information and rational expectations, and that newly revealed information about a company's prospects is almost immediately reflected in the current stock price. Predicting how the stock market will perform is one of the most difficult things to do. There are so many factors involved in the prediction – physical factors vs. physhological, rational and irrational behaviour, etc. All these aspects combine to make share prices volatile and very difficult to predict with a high degree of accuracy.

In this endeavor we worked with historical data of the stock prices of few publicly listed companies and implemented a machine learning model based on Long Short Term Memory(LSTM) in order to predict the future prices.

![image.png](attachment:image.png)

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Exploratory Data Analysis(EDA)

When we’re getting started with a machine learning (ML) project, one critical principle to keep in mind is that data is everything. It is often said that if ML is the rocket engine, then the fuel is the (high-quality) data fed to ML algorithms. However, deriving truth and insight from a pile of data can be a complicated and error-prone job. To have a solid start for our ML project, it always helps to analyze the data up front.

During EDA, it’s important that we get a deep understanding of:

* The **properties of the data**, such as schema and statistical properties;
* The **quality of the data**, like missing values and inconsistent data types;
* The **predictive power of the data**, such as correlation of features against target.

This project didn't require profound EDA as the data was time-series. Only thing to enusure in the dataset of AMD were the case missing values. Fortunately, that didn't turn out to be true.

Using pandas info() function of the dataframe structure we found that all rows of the Open prices were filled.

In [None]:
dataset = pd.DataFrame(pd.read_csv("/kaggle/input/tesla-stock-price/Tesla.csv - Tesla.csv.csv"))

In [None]:
dataset.shape

In [None]:
dataset.head()

In [None]:
dataset.tail()

In [None]:
# check for any correlation
plt.figure(figsize = (10,10))
sns.heatmap(dataset.corr(), annot = True, fmt = ".1g", vmin = -1, vmax = 1, center = 0, linewidth = 3,
           linecolor = "black", square = True, cmap = "summer")

This heat map could be used in order to understand the available stock volume's correlation with other prices(open, close, max, min) for future applications. However, in this project, we keep ourselves to open stock prices prediction based on historical data.

In [None]:
dataset.info()

There is no missing value. We have full entry.

In [None]:
plt.figure(figsize = (20, 12))
x = np.arange(0, dataset.shape[0], 1)
plt.subplot(2,1,1)
plt.plot(x, dataset.Open.values, color = "red", label = "Open Tesla Price")
plt.plot(x, dataset.Close.values, color = "blue", label = "Close Tesla Price")
plt.title("Tesla Stock Prices 2010-2017")
plt.xlabel("Days")
plt.ylabel("Stock Prices in US Dollar")
plt.legend(loc = "best")
plt.grid(which = "major", axis = "both")

plt.subplot(2,1,2)
plt.plot(x, dataset.Volume.values, color = "green", label = "Stock Volume Available")
plt.title("Stock Volume of Tesla b/w 2010-2017")
plt.xlabel("Days")
plt.ylabel("Volume")
plt.legend(loc = "best")
plt.grid(which = "major", axis = "both")
plt.show()

# Hyperparameters :

Our machine learning model was based on two hyperparameters which were :

* `Time Step` : Number of days in the past our model looked at in order to predict the price on the asked day. For illustration, if we set time_step = 7 then for predicting the price on **n th day**, our model analyzed all the prices from **n-1** to **n-7 days**. This approach is relatively more accurate than using a traditional machine learning algorithm - such as polynomial linear regression - as we had considered only the recently reported prices rather than the whole dataset at once.

* `Days` : Number of days in the end for which we have to predict the prices for. These were placed in our validation/test set.

In [None]:
TIME_STEP = 5
DAYS = 20 # number of days at the end for which we have to predict. These will be in our validation set.

In [None]:
dataset = pd.DataFrame(pd.read_csv("/kaggle/input/tesla-stock-price/Tesla.csv - Tesla.csv.csv"))

In [None]:
def dataset_split(dataset) : 
    train = dataset[0: len(dataset) - DAYS]
    val = dataset[len(dataset) - DAYS - TIME_STEP : len(dataset)]
    return train, val

In [None]:
dataset.drop(["Date","High", "Low", "Close", "Volume", "Adj Close"], axis = 1, inplace = True)
dataset = dataset.values

# Scaling :

It refers to putting the values in the same range or same scale so that no variable is dominated by the other.

Most of the times, our dataset contains features highly varying in magnitudes, units and range. But since, most of the machine learning algorithms use Euclidean distance between two data points in their computations, this poses to be a problem. If left alone, these algorithms only take in the magnitude of features neglecting the units. The results would vary greatly between different units, 5kg and 5000gms *for illustration. The features with high magnitudes will weigh in a lot more in the distance calculations than features with low magnitudes. To suppress this effect, we need to bring all features to the same level of magnitudes. This can be achieved by scaling.*

In a nutshell, scaling helps our optimization algorithm converge faster on our data. **In the figure we can see the skeweness of the data distribution decreases a lot after scaling, as a result of which gradient descent(optimization algorithm) converges faster.**

For scaling we will import the scikit-learn Python3 machine learning library where **we use MinMaxScaler to scale all the price values beteen 0 and 1, that is the feature range we provided in the code.**

In [None]:
import sklearn
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range = (0,1))
dataset_scaled = scaler.fit_transform(dataset)

In [None]:
train, val = dataset_split(dataset_scaled)

In [None]:
train.shape, val.shape

# Configuring The Dataset For Deep Learning :

Since we had planned to use an LSTM model for time-series prediction, the conversion of dataset's shape from 1-D to 3-D tensor became mandatory. For this we grouped the values from the past **time_step** days into one and stacked such units one behind the other.

In [None]:
train_x, train_y = [], []
for i in range(TIME_STEP, train.shape[0]) : 
    train_x.append(train[i - TIME_STEP : i, 0])
    train_y.append(train[i, 0])
train_x, train_y = np.array(train_x), np.array(train_y)

In [None]:
val_x, val_y = [], []
for i in range(TIME_STEP, val.shape[0]) : 
    val_x.append(val[i - TIME_STEP : i, 0])
    val_y.append(val[i, 0])
val_x, val_y = np.array(val_x), np.array(val_y)

In [None]:
train_x = np.reshape(train_x, (train_x.shape[0], train_x.shape[1], 1))
val_x = np.reshape(val_x, (val_x.shape[0], val_x.shape[1], 1))
print("Reshaped train_x = ", train_x.shape)
print("Shape of train_y = ", train_y.shape)

print("Reshaped val_x = ", val_x.shape)
print("Shape of val_y = ", val_y.shape)

# Long Short Term Memory - LSTM :

Humans don’t start their thinking from scratch every second. As we read this paragraph, we understand each word based on our understanding of previous words. We don’t throw everything away and start thinking from scratch again. Our thoughts have persistence. Traditional neural networks can’t do this, and it seems like a major shortcoming.

Recurrent Neural Networks(RNNs) address this issue. They are networks with loops in them, allowing information to persist.

Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies.

The entire process of the working behind a RNN is beautifully illustrated at : https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Using Keras API of Tensorflow a model was prepared having layers of LSTM cells stacked onto each other followed by a general Artificial Neural Network(ANN).

* ReLU activation function was used in all the layers with dropout ranging from 0.2-0.4.
* Adam Optimizer and MSE(Mean Squared Error) loss function was used.

In [None]:
import tensorflow as tf

In [None]:
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
gpus = tf.config.list_physical_devices("GPU")
print(gpus)
if len(gpus) == 1 : 
    strategy = tf.distribute.OneDeviceStrategy(device = "/gpu:0")
else:
    strategy = tf.distribute.MirroredStrategy()

In [None]:
tf.config.optimizer.set_experimental_options({"auto_mixed_precision" : True})
print("Mixed precision enabled")

In [None]:
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(monitor= "loss", factor = 0.5, patience = 10,
                                                 min_lr = 0.000001, verbose = 1)
monitor_es = tf.keras.callbacks.EarlyStopping(monitor= "loss", patience = 25, restore_best_weights= False, verbose = True)

In [None]:
model = tf.keras.Sequential()
model.add(tf.keras.layers.LSTM(units = 128, return_sequences = True, input_shape = (train_x.shape[1], 1)))
model.add(tf.keras.layers.Dropout(0.4))
model.add(tf.keras.layers.LSTM(units = 128, return_sequences = True))
model.add(tf.keras.layers.Dropout(0.4))
model.add(tf.keras.layers.LSTM(units = 128, return_sequences = True))
model.add(tf.keras.layers.Dropout(0.4))
model.add(tf.keras.layers.LSTM(units = 128, return_sequences = False))
model.add(tf.keras.layers.Dropout(0.4))

model.add(tf.keras.layers.Dense(units = 20, activation = "relu"))
model.add(tf.keras.layers.Dense(units = 1, activation = "relu"))

In [None]:
model.compile(tf.keras.optimizers.Adam(lr = 0.001), loss = "mean_squared_error")

In [None]:
model.summary()

In [None]:
with tf.device("/device:GPU:0"):
    history = model.fit(train_x, train_y, epochs = 300, batch_size = 16, callbacks = [reduce_lr, monitor_es])

In [None]:
plt.figure(figsize = (12, 4))
plt.plot(history.history["loss"], label = "Training loss")
plt.title("Loss analysis")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend(["Train"])
plt.grid("both")

In [None]:
model_json = model.to_json()
with open("tesla_open_1.json", "w") as json_file:
  json_file.write(model_json)

model.save_weights("tesla_open_1.h5")

In [None]:
from keras.models import model_from_json
json_file = open('tesla_open_1.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)
loaded_model.load_weights("tesla_open_1.h5")
print("Loaded model from disk")
loaded_model.compile(loss='mean_squared_error', optimizer='adam')

In [None]:
real_prices = val[TIME_STEP:]
real_prices = scaler.inverse_transform(real_prices)

In [None]:
predicted_prices = loaded_model.predict(val_x)
predicted_prices = scaler.inverse_transform(predicted_prices)

In [None]:
plt.figure(figsize= (16, 5))
plt.subplot(1,1,1)

x = np.arange(0, DAYS, 1)

plt.plot(x, real_prices, color = "red", label = "Real Tesla Prices")
plt.plot(x, predicted_prices, color = "blue", label = "Predicted Tesla Prices")
plt.title("Tesla Open Stock Prices", fontsize = 18)
plt.xlabel("Time In Days", fontsize = 18)
plt.ylabel("Stock Prices in US Dollars", fontsize = 18)
plt.legend()
plt.grid("both")

In [None]:
original_training_prices = scaler.inverse_transform(train)
original_training_prices

In [None]:
x1 = np.arange(0,len(original_training_prices),1)
x2 = np.arange(len(original_training_prices), len(dataset), 1)
print(len(x1), len(x2))

In [None]:
plt.figure(figsize= (16,8))
plt.subplot(1,1,1)

X = len(dataset)
x1 = np.arange(0,len(original_training_prices),1)
x2 = np.arange(len(original_training_prices), len(dataset), 1)

plt.plot(x1, original_training_prices, color = "green")
plt.plot(x2, real_prices, color = "red", label = "Real Tesla Prices")
plt.plot(x2, predicted_prices, color = "blue", label = "Predicted Tesla Prices")
plt.title("Tesla Open Stock Prices", fontsize = 18)
plt.xlabel("Time In Days", fontsize = 18)
plt.ylabel("Stock Prices in US Dollars", fontsize = 18)
plt.legend()
plt.grid("both")