In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

pd.options.display.max_columns = 50
sns.set_style('darkgrid')

# What is NIFTY 50

![](https://imgur.com/fEgI9b6.png)


The NIFTY 50 index is [National Stock Exchange of India's](https://en.wikipedia.org/wiki/National_Stock_Exchange_of_India) benchmark broad based stock market index for the Indian equity market. NIFTY 50 stands for National Index Fifty, and represents the weighted average of 50 Indian company stocks in 17 sectors. It is one of the two main stock indices used in India, the other being the BSE Sensex


![](https://i2.wp.com/stableinvestor.com/wp-content/uploads/2019/09/Nifty-Indexes-Broad-Markets.png?w=630&ssl=1)

The dataset consists of 13 files.Let's quickly understand what those are:

## INDIAVIX 

India VIX is a volatility index based on the NIFTY Index Option prices.Volatility Index is a measure of market’s expectation of volatility over the near term. Volatility is often described as the “rate and magnitude of changes in prices" and in finance often referred to as risk. Volatility Index is a measure, of the amount by which an underlying Index is expected to fluctuate, in the near term,

## NIFTY 50, NIFTY 100 and NIFTY 500 
  * NIFTY 500 - It represents the top 500 companies based on full market capitalisation from the eligible universe
  * NIFTY 100 - This represents the top 100 companies (i.e. from 1 to 100) from within the NIFTY 500. This index basically tries to track the performance of companies having large market caps.
  * NIFTY 50 - This represents the first 50 companies from the NIFTY 100.

## NIFTY SMALL CAP & MID CAP
* NIFTY SMALLCAP - This index measures the performance of small-cap companies.
* NIFTY MIDCAP - This index tries to measure the performance of mid-cap companies.

## NIFTY NEXT 50
This includes the remaining 50 companies from NIFTY 100 after excluding the NIFTY 50 companies. These are also called as NIFTY Junior.

## NIFTY SECTORAL INDICES
This includes NIFTY AUTO,NIFTY BANK, NIFTY FMCG, NIFTY IT,NIFTY METAL, NIFTY PHARMA
These Indices are designed to reflect the behavior and performance of the segment that they reflect i.e automobiles, bank, pharma etc.



In [None]:
nifty_50 = pd.read_csv('../input/nifty-indices-dataset/NIFTY 50.csv',parse_dates=["Date"], index_col="Date")
nifty_50.head()

## About the Stock Data

Now that our data has been converted into the desired format, let’s take a look at its various columns for further analysis.

* **The Open and Close columns** indicate the opening and closing price of the stocks on a particular day.
* **The High and Low columns** provide the highest and the lowest price for the stock on a particular day, respectively.
* **The Volume column** tells us the total volume of stocks traded on a particular day.
* **The Turnover column** refers to the total value of stocks traded during a specific period of time. The time period may be annually, quarterly, monthly or daily
* **P/E** also called as the price-earnings ratio relates a company's share price to its earnings per share.
* **P/B** also called as Price-To-Book ratio measures the market's valuation of a company relative to its book value.
* **Div Yield** or the dividend yield is the amount of money a company pays shareholders (over the course of a year) for owning a share of its stock divided by its current stock price—displayed as a percentage.  

**Missing Values**

In [None]:
nifty_50.isnull().sum()

In [None]:
nifty_50.interpolate(method='time', inplace=True)

# Analysing the NIFTY 50 data

In [None]:
ax = nifty_50[['High', 'Low']].plot(figsize=(20, 6))
ax.set_title('High v/s Low', fontsize=24);

In [None]:
ax = nifty_50[['Close']].plot(figsize=(20, 6))
ax.set_title('Closing Prices', fontsize=24);

## P/E v/s P/B Ratio : Which one to use?

[P/E ratio is a popular measure](https://towardsdatascience.com/visualizing-the-stock-market-with-tableau-c0a7288e7b4d) of how expensive a company’s stock is. It is simply the company’s market capitalization divided by its net income — in other words, how much does it cost us to buy $1 of a particular company’s earnings. The higher the P/E ratio, all other things equal, the more expensive a stock is perceived to be.the P/E ratio shows what the market is willing to pay today for a stock based on its past or future earnings. A high P/E could mean that a stock's price is high relative to earnings and possibly overvalued. Conversely, a low P/E might indicate that the current stock price is low relative to earnings. 

![](https://imgur.com/mNCjWPD.png)

The **P/B ratio** on the other hand measures the market's valuation of a company relative to its book value.P/B ratio is used by value investors to identify potential investments and P/B ratios under 1 are typically considered solid investments.

![](https://imgur.com/uFGqIRV.png)

In [None]:
ax = nifty_50[['P/E', 'P/B']].plot(figsize=(20, 6))
ax.set_title('P/E v/s P/B', fontsize=24);

## Market Performance 2019 Onwards

In [None]:
ax = nifty_50[['Close']]['2019':].plot(figsize=(20, 6))
ax.set_title('Closing Prices', fontsize=24);

## Performance of other Nifty Sectoral Indices in 2020

Let us now see the performance of NIFTY's sectoral indices which have been provided in the data. It'll be interesting to see how they have fared in these times of turmoil.

Let's quickly understand what each of them represent:

* **NIFTy Auto Index**: The Nifty Auto Index is designed to reflect the behavior and performance of the Automobiles sector which includes manufacturers of cars & motorcycles, heavy vehicles, auto ancillaries, tyres, etc. 

* **NIFTY Bank Index**: Nifty Bank Index is an index comprised of the most liquid and large capitalised Indian Banking stocks. It provides investors and market intermediaries with a benchmark that captures the capital market performance of Indian Banks

* **NIFTY FMCG Index**: The Nifty FMCG Index comprises of maximum of 15 companies who manufacture such FMGC(Fast Moving Consumer Goods) products

* **NIFTY IT Index**: Companies in this index are those that have more than 50% of their turnover from IT related activities like IT Infrastructure , IT Education and Software Training , Telecommunication Services and Networking Infrastructure, Software Development, Hardware Manufacturer’s, Vending, Support and Maintenance.

* **NIFTY Metal Index**: The Nifty Metal Index is designed to reflect the behavior and performance of the Metals sector including mining. The Nifty Metal Index comprises of maximum of 15 stocks that are listed on the National Stock Exchange.

* **NIFTY Pharma Index**: Nifty Pharma Index to capture the performance of the Pharmaceuticals companies in this sector.

In [None]:
nifty_auto = pd.read_csv('../input/nifty-indices-dataset/NIFTY AUTO.csv',parse_dates=["Date"], index_col="Date")
nifty_bank = pd.read_csv('../input/nifty-indices-dataset/NIFTY BANK.csv',parse_dates=["Date"], index_col="Date")
nifty_fmcg = pd.read_csv('../input/nifty-indices-dataset/NIFTY FMCG.csv',parse_dates=["Date"], index_col="Date")
nifty_IT = pd.read_csv('../input/nifty-indices-dataset/NIFTY IT.csv',parse_dates=["Date"], index_col="Date")
nifty_metal = pd.read_csv('../input/nifty-indices-dataset/NIFTY METAL.csv',parse_dates=["Date"], index_col="Date")
nifty_pharma = pd.read_csv('../input/nifty-indices-dataset/NIFTY PHARMA.csv',parse_dates=["Date"], index_col="Date")


nifty_auto.interpolate(method='time', inplace=True)
nifty_bank.interpolate(method='time', inplace=True)
nifty_fmcg.interpolate(method='time', inplace=True)
nifty_IT.interpolate(method='time', inplace=True)
nifty_metal.interpolate(method='time', inplace=True)
nifty_pharma.interpolate(method='time', inplace=True)

df = pd.DataFrame({
    'NIFTY Auto index': nifty_auto['Close']['2020':].values, 
    'NIFTY Bank index': nifty_bank['Close']['2020':].values,
    'NIFTY FMCG index': nifty_fmcg['Close']['2020':].values,
    'NIFTY IT index': nifty_IT['Close']['2020':].values,
    'NIFTY Metal index': nifty_metal['Close']['2020':].values,
    'NIFTY Pharma index': nifty_pharma['Close']['2020':].values,
})

In [None]:
ax = df.plot.box(figsize=(20, 6))

In [None]:
ax = df.plot(subplots=True, figsize=(20, 12))

# Time Series Analysis

## What is Time Series Data
Time series data is a sequence of data points in chronological order that is used by businesses to analyze past data and make future predictions. These data points are a set of observations at specified times and equal intervals, typically with a datetime index and corresponding value. Common examples of time series data in our day-to-day lives include:     

* Measuring weather temperatures 
* Measuring the number of taxi rides per month
* Predicting a company’s stock prices for the next day


## Components of Time Series

Time series data consist of four components:

1. Trend Component: This is a variation that moves up or down in a reasonably predictable pattern over a long period.

2. Seasonality Component: is the variation that is regular and periodic and repeats itself over a specific period such as a day, week, month, season, etc.,

3. Cyclical Component: is the variation that corresponds with business or economic 'boom-bust' cycles or follows their own peculiar cycles, and

4. Random Component: is the variation that is erratic or residual and does not fall under any of the above three classifications.

<img src='https://kite.com/wp-content/uploads/2019/08/variations-of-time-series.jpg'>

## Dataset
In this notebook, we’ll use it to analyze stock prices of RELIANCE 

* The **Open and Close columns** indicate the opening and closing price of the stocks on a particular day.
* The **High and Low columns** provide the highest and the lowest price for the stock on a particular day, respectively.
* The **Volume column** tells us the total volume of stocks traded on a particular day.
* The **volume weighted average price (VWAP)** is a trading benchmark used by traders that gives the average price a security has traded at throughout the day, based on both volume and price. It is important because it provides traders with insight into both the trend and value of a security

In [None]:
data = pd.read_csv("/kaggle/input/nifty50-stock-market-data/RELIANCE.csv", parse_dates=['Date'], index_col='Date', usecols=['Date', 'Open','High','Low','Close','Volume','VWAP'])
print(data.shape)
data.head()

In [None]:
data.tail()

In [None]:
ax = data['VWAP'].plot(figsize=(20,6))
ax.set_title('RELIANCE Stock Prices')
ax.axvspan('2014-06-01','2020-10-30', color='green', alpha=0.2) # Modi Govt
ax.axvspan('2020-03-01','2020-10-30', color='red', alpha=0.3) # Covid Pandemic
ax.set_ylabel('VWAP');

In [None]:
fig, ax = plt.subplots(figsize=(20, 6))
sns.kdeplot(data['VWAP'],shade=True, ax=ax);

## Resampling and Missing Treatment

In [None]:
data = data.resample('D').mean()
data.isnull().sum()

In [None]:
data.interpolate(method='time', inplace=True)

## Seasonal decomposition

We can decompose a time series into trend, seasonal amd remainder components. The series can be decomposed as an additive or multiplicative combination of the base level, trend, seasonal index and the residual.

The seasonal_decompose in statsmodels is used to implements the decomposition.

In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose
from dateutil.parser import parse

plt.rcParams.update({'figure.figsize': (20,10)})
y = data['VWAP'].to_frame()


# Multiplicative Decomposition 
seasonal_decompose(y, model='multiplicative',period = 52).plot().suptitle('Multiplicative Decompose', fontsize=22)

# Additive Decomposition
seasonal_decompose(y, model='additive',period = 52).plot().suptitle('Additive Decompose', fontsize=22);

## Stationarity

In the most intuitive sense, stationarity means that the statistical properties of a process generating a time series do not change over time. It does not mean that the series does not change over time, just that the way it changes does not itself change over time. The algebraic equivalent is thus a linear function, perhaps, and not a constant one; the value of a linear function changes as 𝒙 grows, but the way it changes remains constant — it has a constant slope; one value that captures that rate of change.

In [None]:
from statsmodels.tsa.stattools import adfuller

def adf_test(series,title=''):
    """
    Pass in a time series and an optional title, returns an ADF report
    """
    print(f'Augmented Dickey-Fuller Test: {title}')
    result = adfuller(series.dropna(),autolag='AIC') # .dropna() handles differenced data
    
    labels = ['ADF test statistic','p-value','# lags used','# observations']
    out = pd.Series(result[0:4],index=labels)

    for key,val in result[4].items():
        out[f'critical value ({key})']=val
        
    print(out.to_string(), '\n')          # .to_string() removes the line "dtype: float64"
    
    if result[1] <= 0.05:
        print("Strong evidence against the null hypothesis")
        print("Reject the null hypothesis")
        print("Data has no unit root and is stationary")
    else:
        print("Weak evidence against the null hypothesis")
        print("Fail to reject the null hypothesis")
        print("Data has a unit root and is non-stationary")
        
    return out

adf_test(data['VWAP'],title='Reliance Stock Data');

In [None]:
ax = data['VWAP'].resample('M').mean().plot.line(figsize=(20, 6))
ax.axvspan('2014-06','2020-10', color='green', alpha=0.2) # Modi Govt
ax.axvspan('2020-03','2020-10', color='red', alpha=0.3) # Covid Pandemic
ax.set_title('Monthly Mean VWAP for Reliance');

In [None]:
ax = data['VWAP'].resample('A').mean().plot.bar(figsize=(20, 6))
ax.set_title('Yearly Mean VWAP for Reliance');

# Plotting ACF and PACF

**Autocorrelation** and **partial autocorrelation** plots are heavily used in time series analysis and forecasting.

These are plots that graphically summarize the strength of a relationship with an observation in a time series with observations at prior time steps.

**Statistical correlation** summarizes the strength of the relationship between two variables.

We can calculate the correlation for time series observations with observations with previous time steps, called lags. Because the correlation of the time series observations is calculated with values of the same series at previous times, this is called a **serial correlation, or an autocorrelation.**

A plot of the autocorrelation of a time series by lag is called the AutoCorrelation Function, or the acronym ACF. This plot is sometimes called a **correlogram or an autocorrelation plot**.

A **partial autocorrelation** is a summary of the relationship between an observation in a time series with observations at prior time steps with the relationships of intervening observations removed.

The autocorrelation for an observation and an observation at a prior time step is comprised of both the direct correlation and indirect correlations. These indirect correlations are a linear function of the correlation of the observation, with observations at intervening time steps.

It is these indirect correlations that the partial autocorrelation function seeks to remove. Without going into the math, this is the intuition for the partial autocorrelation.

A **partial autocorrelation** is a summary of the relationship between an observation in a time series with observations at prior time steps with the relationships of intervening observations removed.

The autocorrelation for an observation and an observation at a prior time step is comprised of both the direct correlation and indirect correlations. These indirect correlations are a linear function of the correlation of the observation, with observations at intervening time steps.

It is these indirect correlations that the partial autocorrelation function seeks to remove. Without going into the math, this is the intuition for the partial autocorrelation.

In [None]:
import statsmodels.api as sm

In [None]:
plt.rcParams.update({'figure.figsize': (20,6)})

sm.graphics.tsa.plot_acf(data['VWAP'], lags=30,title='auto correlation of VWAP',zero=False);
sm.graphics.tsa.plot_pacf(data['VWAP'], lags=30,title='partial auto correlation of VWAP',zero=False);

# Feature Engineering
Almost every time series problem will have some external features or some internal feature engineering to help the model.

Let's add some basic features like lag values of available numeric features that are widely used for time series problems. Since we need to predict the price of the stock for a day, we cannot use the feature values of the same day since they will be unavailable at actual inference time. We need to use statistics like mean, standard deviation of their lagged values.

We will use three sets of lagged values, one previous day, one looking back 7 days and another looking back 30 days as a proxy for last week and last month metrics.

In [None]:
data = data.reset_index()
lag_features = ["Open", "High", "Low", "Close", "VWAP", "Volume"]
window1 = 3
window2 = 7
window3 = 30

df_rolled_3d = data[lag_features].rolling(window=window1, min_periods=0)
df_rolled_7d = data[lag_features].rolling(window=window2, min_periods=0)
df_rolled_30d = data[lag_features].rolling(window=window3, min_periods=0)

df_mean_3d = df_rolled_3d.mean().shift(1).reset_index().astype(np.float32)
df_mean_7d = df_rolled_7d.mean().shift(1).reset_index().astype(np.float32)
df_mean_30d = df_rolled_30d.mean().shift(1).reset_index().astype(np.float32)

df_std_3d = df_rolled_3d.std().shift(1).reset_index().astype(np.float32)
df_std_7d = df_rolled_7d.std().shift(1).reset_index().astype(np.float32)
df_std_30d = df_rolled_30d.std().shift(1).reset_index().astype(np.float32)

for feature in lag_features:
    data[f"{feature}_mean_lag{window1}"] = df_mean_3d[feature]
    data[f"{feature}_mean_lag{window2}"] = df_mean_7d[feature]
    data[f"{feature}_mean_lag{window3}"] = df_mean_30d[feature]
    
    data[f"{feature}_std_lag{window1}"] = df_std_3d[feature]
    data[f"{feature}_std_lag{window2}"] = df_std_7d[feature]
    data[f"{feature}_std_lag{window3}"] = df_std_30d[feature]

data.set_index("Date", drop=False, inplace=True)
data.interpolate(method='time', inplace=True)
data.fillna(data.mean(), inplace=True)
data.head()

For boosting models, it is very useful to add datetime features like hour, day, month, as applicable to provide the model information about the time component in the data. For time series models it is not explicitly required to pass this information

In [None]:
data.Date = pd.to_datetime(data.Date, format="%Y-%m-%d")
data["month"] = data.Date.dt.month
data["week"] = data.Date.dt.week
data["day"] = data.Date.dt.day
data["day_of_week"] = data.Date.dt.dayofweek
data.head()

In [None]:
exogenous_features = data.columns[7:]

In [None]:
df_train = data.loc[:"2018"]
df_valid = data.loc["2019"]

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Dummy Model

In [None]:
from sklearn.dummy import DummyRegressor

model = DummyRegressor().fit(df_train[exogenous_features], df_train['VWAP'])
df_valid['Dummy_preds'] = model.predict(df_valid[exogenous_features])
print('RMSE:', np.sqrt(mean_squared_error(df_valid['VWAP'], df_valid['Dummy_preds'])))
print('MAE:', mean_absolute_error(df_valid['VWAP'], df_valid['Dummy_preds']))
df_valid[['VWAP', 'Dummy_preds']].plot();

# LightGBM

In [None]:
from lightgbm import LGBMRegressor

model = LGBMRegressor().fit(df_train[exogenous_features], df_train['VWAP'])
df_valid['LGBM_preds'] = model.predict(df_valid[exogenous_features])
print('RMSE:', np.sqrt(mean_squared_error(df_valid['VWAP'], df_valid['LGBM_preds'])))
print('MAE:', mean_absolute_error(df_valid['VWAP'], df_valid['LGBM_preds']))
df_valid[['VWAP', 'LGBM_preds']].plot();

# ARIMAX

In [None]:
!pip install pmdarima
from pmdarima import auto_arima

In [None]:
%%time

model = auto_arima(
    df_train.VWAP, exogenous=df_train[exogenous_features], 
    trace=True, error_action="ignore", suppress_warnings=True
).fit(df_train.VWAP, exogenous=df_train[exogenous_features])

df_valid["ARIMAX_preds"] = model.predict(n_periods=len(df_valid), exogenous=df_valid[exogenous_features])

print('RMSE:', np.sqrt(mean_squared_error(df_valid['VWAP'], df_valid['ARIMAX_preds'])))
print('MAE:', mean_absolute_error(df_valid['VWAP'], df_valid['ARIMAX_preds']))
df_valid[['VWAP', 'ARIMAX_preds']].plot();

In [None]:
model.summary()

## Error Analysis

In [None]:
df_valid['ARIMAX_error'] = df_valid['VWAP'] - df_valid['ARIMAX_preds']
df_valid['LGBM_error'] = df_valid['VWAP'] - df_valid['LGBM_preds']
ax = df_valid[['LGBM_error', 'ARIMAX_error']].plot()
ax.set_title('Errors', fontsize=18);

In [None]:
fig, ax = plt.subplots()
ax.plot(df_valid['VWAP'].values, df_valid['LGBM_error'].values, '.', label='LGBM_error')
ax.plot(df_valid['VWAP'].values, df_valid['ARIMAX_error'].values, '.', label='ARIMAX_error')
ax.set_title('Residual Plots', fontsize=18)
ax.legend();

The residuals are randomly distributed, which indicates that the models are performing very well

**References**

* [Nifty Data EDA](https://www.kaggle.com/parulpandey/nifty-data-eda)
* [Getting started with Time Series using Pandas](https://www.kaggle.com/parulpandey/getting-started-with-time-series-using-pandas)
* [A modern Time Series tutorial](https://www.kaggle.com/rohanrao/a-modern-time-series-tutorial)
* [Time Series Analysis and Forecasting - Reliance](https://www.kaggle.com/yashvi/time-series-analysis-and-forecasting-reliance)