# Zomata Stock Price Predicting using Time-Series analysis


## GRU Model

GRU (Gated Recurrent Unit) is a recurrent neural network (RNN) that processes sequential input, such as time series or natural language processing. GRUs, like LSTMs, are intended to overcome the vanishing gradient problem that can arise while training standard RNNs. However, GRUs are simpler and have fewer parameters than LSTMs, making them quicker to train and less susceptible to overfitting.

The GRU layer is imported from the tensorflow.keras.layers module, which is a component of the TensorFlow deep learning system. The GRU layer is used to generate a GRU model that can be trained using sequential data, such as stock price data. The GRU layer accepts an input shape, which determines the amount of time steps and features in the input data, and returns a series of hidden states that can then fed into a dense layer to make predictions.

## ARCH Model

An ARCH (Autoregressive Conditional Heteroskedasticity) model is a statistical method for modelling and forecasting time series data with time-varying volatility. Robert F. Engle introduced the ARCH model in 1982. The ARCH model is a form of GARCH (Generalised Autoregressive Conditional Heteroskedasticity) model, which is a broader family of models that enable the variance to be determined by previous variances and errors over a longer period of time. The ARCH model is used to calculate the variance of a time series as a function of previous variances and errors. The model is trained via maximum likelihood estimation (MLE) or another estimate approach.The ARCH model is used to model and forecast time series data that has variable volatility over time, such as stock prices, exchange rates, and interest rates. The ARCH model is a statistical model that is trained with the arch_model class in the arch Python package. The model is trained on previous time series data and then used to anticipate future data.



## GRU vs ARCH 

GRU (Gated Recurrent Unit) models and ARCH (Autoregressive Conditional Heteroskedasticity) models are both used for time series forecasting, although they operate on distinct principles and with different types of data.

GRU models are recurrent neural networks (RNNs) that analyse sequential data, such as time series or natural language processing. GRU models forecast future stock values using previous data.

ARCH models, on the other hand, are statistical models for modelling and forecasting time series data with variable volatility across time. The ARCH model is used to calculate the variance of a time series as a function of previous variances and errors.

When it comes to forecasting stocks, GRU models offer some benefits over ARCH models:

- GRU models are more adaptable and can capture more complicated patterns in data than ARCH models.
- GRU models can accommodate missing and unevenly spaced data, but ARCH models require consistently spaced data.
- GRU models may anticipate many steps ahead, whereas ARCH models are commonly employed for one-step forecasting.
- GRU models can handle both univariate and multivariate time series data, whereas ARCH models are mainly utilised with univariate data.


In conclusion, GRU models are more flexible and capable of capturing more complicated patterns in data than ARCH models. As a result, GRU models outperform ARCH models in terms of stock prediction.


## Data Cleaning and Preparation

 In data gathering, cleaning and processing. Here we are looking at systematically collecting information from various data sources, in that process we also identify the errors, inconsistencies and correcting them so that we can have data that is easy to use.

In [2]:
# The fisrt step begins with importing all the necessary libraries. 
#This is done by importing the appropriate libraries from the appropriate repositories.

import pandas as pd
#from pandas import NaT
import numpy as np

import datetime
import matplotlib.pyplot as plt
#import seaborn as sns
import plotly.express as px

## Displaying the data at its raw state

In [3]:
# import the cvs file 
df = pd.read_csv(r"C:\Users\sibon\Downloads\zomato.csv")

In [5]:
df

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2021-07-23,116.000000,138.899994,115.000000,126.000000,126.000000,694895290
1,2021-07-26,126.349998,143.750000,125.300003,140.649994,140.649994,249723854
2,2021-07-27,141.699997,147.800003,127.750000,132.899994,132.899994,240341900
3,2021-07-28,131.000000,135.000000,123.550003,131.199997,131.199997,159793731
4,2021-07-29,134.949997,144.000000,132.199997,141.550003,141.550003,117973089
...,...,...,...,...,...,...,...
626,2024-02-01,141.000000,143.500000,138.550003,140.550003,140.550003,70252449
627,2024-02-02,141.800003,145.000000,141.449997,143.800003,143.800003,78666454
628,2024-02-05,145.000000,145.399994,138.250000,140.250000,140.250000,54189688
629,2024-02-06,140.399994,141.800003,138.050003,139.949997,139.949997,46782951


## Checking for missing values.

In [None]:
df.isnull().sum()
#no null values found in the data.

In [None]:
#summary of the data set.
df.describe()

In [None]:
#Using skimpy to get a more detailed description of the data set.

import skimpy as sk
sk.skim(df)

# Convert date Strings.

This code snippet defines lists for days and months and provides a function to convert date strings into a structured format. This type of code is essential for handling dates effectively in various applications. It allows us to work with dates in a structured manner, making our code more robust and insightful.

In [None]:
days = ["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"]
months = ["January","February","March","April","May","June","July","August","September","October","November","December"]
def convert_dates(x):
    date = datetime.datetime.strptime(x, "%Y-%m-%d")
    return [date.year, date.month, date.day, date.isoweekday()]

This code transforms date strings into structured components (year, month, day, and weekday) and adds them as new columns to the DataFrame. These new columns can be useful for further analysis or visualization. 

In [None]:
df["Year"] = df["Date"].apply(lambda x: convert_dates(x)[0])
df["Month"] = df["Date"].apply(lambda x: months[convert_dates(x)[1]-1])
df["Day"] = df["Date"].apply(lambda x: convert_dates(x)[2])
df["Weekday"] = df["Date"].apply(lambda x: days[convert_dates(x)[3]-1])

This loop will create line plots for each column from the second to the seventh column.
 for i in df.columns[1:7]:: This is a loop that iterates over column names in the DataFrame df starting from the second column (df.columns[1]) up to, but not including, the seventh column (df.columns[7]).
fig = px.line(df, x="Date", y=i, color="Year"): Inside the loop, a line plot is created using Plotly Express (px). The x parameter specifies the data for the x-axis, which is the "Date" column of the DataFrame. The y parameter specifies the data for the y-axis, which is the column currently being iterated over (i). The color parameter specifies how the lines will be colored, based on the "Year" column of the DataFrame.

In [None]:
for i in df.columns [1 :7]:
    fig = px.line(df, x="Date", y=i , color= "Year")
    fig.show()

## Dealing with the outliers 

In [None]:
# identifying the range values for each column in the data.
df = df.describe([x*0.1 for x in range (10)])

In [None]:
# displaying the boxplot showing the the distribution of the outliers.
import seaborn as sns
sns.boxplot(x=df1['Open'])

## Code Breakdown

### Data Preperation(prep_data_function)

In [None]:
# Function to prep csv file
def prepare_data():
    # Read stock price data
    url = 'https://raw.githubusercontent.com/Tiaan-Botes/CYO_Project_Group_E/52663ba4e6833e232f1bbd7a0ab48edb23f52b91/data/data.csv'
    data = pd.read_csv(url)

    data['Date'] = pd.to_datetime(data['Date'])
    data.sort_values('Date', inplace=True)
    data.set_index('Date', inplace=True)
    data.drop(columns=['Unnamed: 0', 'Year', 'Month', 'Day', 'Weekday'], inplace=True)

    data['ma_30'] = data['Adj Close'].rolling(window=30).mean()
    data['ma_90'] = data['Adj Close'].rolling(window=90).mean()
    
    data['daily_returns'] = data['Adj Close'].pct_change()*100
    data.dropna(inplace=True)

    return data

The programme begins by retrieving stock price information from a CSV file available on GitHub. This information most likely includes historical stock values for a certain firm.

It then processes the data.
- Converts the 'Date' column to datetime format and uses it as an index. This step makes the date the main identifier for each data entry.
- Sorts the data by date to maintain chronological order.
- Removes unnecessary columns like 'Unnamed: 0', 'Year', 'Month', 'Day', and 'Weekday'. These columns can be superfluous or irrelevant to the study.
- For the 'Adj Close' column, two moving averages (MA) are calculated across various time frames (30 and 90 days). Moving averages are often used to smooth out volatility and show long-term patterns in data.
- Calculates daily returns using the percentage change in the 'Adj Close' column from one day to the next. This measure gives insight into the stock's daily volatility or performance.
- Removes rows with missing values to ensure data integrity.




## Visualization






In [None]:
df = prepare_data()
df.info()
print(df.head())

plot_data = df.loc['2023-01-01':'2023-12-31']

# Plotting stock price along with moving averages
plt.figure(figsize=(20, 25))
fig = px.line(
    data_frame=plot_data, 
    x=plot_data.index, 
    y=['Adj Close', 'ma_30', 'ma_90']
)
fig.show()

# Plotting daily returns
plt.figure(figsize=(10, 25))
fig = px.line(
    data_frame=df, 
    x=df.index, 
    y=['daily_returns']
)
fig.show()


After preprocessing the data, the code creates visualisations to gather insights:
- The first chart depicts the stock's adjusted closing price, as well as its 30-day and 90-day moving averages. This visualisation depicts the trend and fluctuation of the stock price over time.
- The second plot depicts the stock's daily returns, allowing analysts to discover times of significant volatility or unexpected price moves.

## GRU Model Preparation and Training:

In [None]:
# Function to prepare data for GRU model
def prepare_gru_data():
    data = prepare_data()  
    dataset = data[['Adj Close']].values.astype('float32')
    scaler = MinMaxScaler(feature_range=(0, 1))
    dataset = scaler.fit_transform(dataset)

    return dataset

# Function to build GRU model
def build_gru_model(input_shape):
    model = Sequential()
    model.add(GRU(units=50, input_shape=input_shape))
    model.add(Dense(units=1))
    model.compile(optimizer='adam', loss='mean_squared_error')

    return model

# Function to train GRU model
def train_gru_model(dataset):
    train_size = int(len(dataset) * 0.67)
    test_size = len(dataset) - train_size
    train, test = dataset[0:train_size,:], dataset[train_size:len(dataset),:]

    def create_dataset(dataset, look_back=1):
        X, Y = [], []
        for i in range(len(dataset)-look_back-1):
            a = dataset[i:(i+look_back), 0]
            X.append(a)
            Y.append(dataset[i + look_back, 0])

        return np.array(X), np.array(Y)

    look_back = 1
    trainX, trainY = create_dataset(train, look_back)
    testX, testY = create_dataset(test, look_back)

    trainX = np.reshape(trainX, (trainX.shape[0], 1, trainX.shape[1]))
    testX = np.reshape(testX, (testX.shape[0], 1, testX.shape[1]))

    model = build_gru_model((1, look_back))
    model.fit(trainX, trainY, epochs=100, batch_size=1, verbose=2)

    return model, testX, testY
# Function to make predictions using GRU model
def predict_gru_model(model, testX):
    return model.predict(testX)

# Preparing data
gru_dataset = prepare_gru_data()

# Training 
gru_model, testX, testY = train_gru_model(gru_dataset)

# predictions
gru_predictions = predict_gru_model(gru_model, testX)

The algorithm then prepares data for training a Gated Recurrent Unit (GRU) model, 
The prepare_gru_data method pulls the 'Adj Close' column from the generated DataFrame and scales it using MinMaxScaler, transforming the data into a range of 0 to 1. Scaling guarantees that all input characteristics contribute equally to model training while preventing any one element from overpowering the others.


Next, the code specifies functions for constructing, training, and utilising the GRU model.
The build_gru_model method creates a Sequential Neural Network model with TensorFlow's Keras API. It is made up of a GRU layer with 50 units, followed by a Dense layer with one unit. The model is built using the Adam optimizer and mean squared error loss, both of which are popular methods for training regression models.

The train_gru_model function divides the dataset into training and testing sets, generates input-output pairs by shifting the time series data with a defined look-back window, reshapes the input data to fit the GRU model's input criteria, constructs the model, and trains it on the training data. The model is trained for 100 epochs(pass throughs), with a batch size of one.

The predict_gru_model function uses the trained GRU model and input data to make predictions for the test set.


## Model Assessment:

Finally, the algorithm assesses the performance of the GRU model and displays the outcomes:

It estimates the Root Mean Squared Error (RMSE) of the actual and anticipated values. The RMSE is a measure of the model's prediction accuracy that shows the average size of the mistakes.

The model's performance is graphically compared by plotting the actual and projected numbers. This enables analysts to determine how well the model reflects the data's underlying trends and if it can accurately forecast future stock values.
