# Stock Price Forecasting Using Historical Data and Financial Indicators

## 1. Project Overview

## Introduction
This project aims to forecast future stock prices using historical data and various machine learning techniques. The process involves data preprocessing, feature engineering, model building, and evaluation.

## Data Preprocessing
The dataset includes historical stock prices with columns for Date, Open, High, Low, Close, and Volume. The Date column was converted to datetime format, and various date-based features were created.

## Exploratory Data Analysis (EDA)
EDA techniques such as line charts, candlestick charts, and correlation analysis were used to understand the data's underlying structure and relationships.

## Feature Engineering
Features such as lagged prices, technical indicators, and date-based features were engineered to improve the model's predictive performance.

## Model Building and Evaluation
A RandomForestRegressor was used to predict future stock prices. The model was fine-tuned using GridSearchCV, and its performance was evaluated using metrics like MSE, MAE, and R-Squared.

## Forecasting
The model was used to predict future stock prices for the next 30 business days. Walk-forward validation was performed to simulate real-time prediction scenarios.


### Understanding the Data
The dataset includes the following columns:

- **Date**: Date of the stock data
- **Open**: Opening price
- **High**: Highest price of the day
- **Low**: Lowest price of the day
- **Close**: Closing price
- **Adj Close**: Adjusted closing price (accounting for corporate actions)
- **Volume**: Number of shares traded


The dataset spans 1258 days, from September 3, 2019, to August 30, 2024.

data source: https://finance.yahoo.com/quote/NDAQ/history/?guce_referrer=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8&guce_referrer_sig=AQAAALB1rZy_4joT3mpo_wfv9r9Vd7iSZNLta4NzV0kttnLWT1eDGj7bGAF8EhCU1EAqQ2GAv5TmEocemroSqvY4ajSEp8hC1rNJs9v43gBU-jLP_SYXRiTkppugoPMYOh1foQDMBABm8hTBG1SvgLyf8k3S7r4oJfvUlIWx2Kyj81GL&guccounter=2&period1=1567241948&period2=1724976000 

## 2. Imports

In [1]:
import pandas as pd


## 3. Data Loading

In [5]:
stock = pd.read_csv('./datasets/NDAQ.csv')
stock.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2019-09-03,33.279999,33.366669,33.006668,33.176666,30.771736,1091700
1,2019-09-04,33.34,33.389999,33.049999,33.303333,30.889227,1215000
2,2019-09-05,33.57,34.036667,33.413334,34.006668,31.54158,1826700
3,2019-09-06,34.116669,34.813332,34.049999,34.753334,32.234119,2254800
4,2019-09-09,34.93,35.086666,34.223331,34.279999,31.795088,1612200


## 4. Data Inspection

In [8]:
stock.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1258 entries, 0 to 1257
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       1258 non-null   object 
 1   Open       1258 non-null   float64
 2   High       1258 non-null   float64
 3   Low        1258 non-null   float64
 4   Close      1258 non-null   float64
 5   Adj Close  1258 non-null   float64
 6   Volume     1258 non-null   int64  
dtypes: float64(5), int64(1), object(1)
memory usage: 68.9+ KB


In [9]:
stock.describe()

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume
count,1258.0,1258.0,1258.0,1258.0,1258.0,1258.0
mean,52.404936,52.912419,51.88394,52.410776,50.740858,2570756.0
std,10.366754,10.406595,10.317112,10.363064,10.627781,1448224.0
min,25.636667,26.856667,23.886667,24.286667,22.84549,575400.0
25%,43.530833,43.965834,43.160833,43.512499,41.164408,1776575.0
50%,54.404999,54.935,54.024999,54.511665,53.01195,2231550.0
75%,60.182499,60.69,59.665001,60.187499,59.108357,2916525.0
max,71.519997,72.139999,71.139999,72.080002,72.080002,18274300.0


**Observations**:

- No missng values
- All parameters have simiar ranges except the volume. Scaling is required

## 5. Exploratory Data Analysis

### Line Charts

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(14, 7))
plt.plot(stock.index, stock['Close'], label='Close Price', color='blue')
plt.title('Historical Stock Prices')
plt.xlabel('Date')
plt.ylabel('Price')
plt.legend()
plt.grid(True)
plt.show()

### Candlestick charts

In [None]:
import plotly.graph_objects as go

fig = go.Figure(data=[go.Candlestick(x=stock.index,
                                     open=stock['Open'],
                                     high=stock['High'],
                                     low=stock['Low'],
                                     close=stock['Close'])])
fig.update_layout(title='Candlestick Chart of Stock Prices',
                  xaxis_title='Date',
                  yaxis_title='Price')
fig.show()

### Scatter plots

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(stock['Volume'], stock['Close'], alpha=0.5)
plt.title('Volume vs. Closing Price')
plt.xlabel('Volume')
plt.ylabel('Close Price')
plt.grid(True)
plt.show()

### Summary statistics

In [None]:
stock[['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']].describe()


### Skewness

In [None]:
from scipy.stats import skew

skewness = skew(stock['Close'])
print(f'Skewness of Closing Prices: {skewness}')


### Simple Moving Average (SMA)

In [None]:
stock['SMA_30'] = stock['Close'].rolling(window=30).mean()

plt.figure(figsize=(14, 7))
plt.plot(stock['Close'], label='Close Price')
plt.plot(stock['SMA_30'], label='30-Day SMA', color='orange')
plt.title('Stock Price and Simple Moving Average')
plt.xlabel('Date')
plt.ylabel('Price')
plt.legend()
plt.grid(True)
plt.show()


### Exponential Moving Average (EMA)

In [None]:
stock['EMA_30'] = stock['Close'].ewm(span=30, adjust=False).mean()

plt.figure(figsize=(14, 7))
plt.plot(stock['Close'], label='Close Price')
plt.plot(stock['EMA_30'], label='30-Day EMA', color='red')
plt.title('Stock Price and Exponential Moving Average')
plt.xlabel('Date')
plt.ylabel('Price')
plt.legend()
plt.grid(True)
plt.show()


### Daily Returns

In [None]:
stock['Daily_Return'] = stock['Close'].pct_change()

plt.figure(figsize=(10, 6))
plt.hist(stock['Daily_Return'].dropna(), bins=50, alpha=0.75)
plt.title('Distribution of Daily Returns')
plt.xlabel('Daily Return')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()


### Correlation between Stocks or Indicators

In [None]:
correlation = stock[['Close', 'Volume']].corr()
print('Correlation Matrix:')
print(correlation)


### Outlier Detection

In [None]:
import seaborn as sns

plt.figure(figsize=(10, 6))
sns.boxplot(x=stock['Close'])
plt.title('Boxplot of Closing Prices')
plt.xlabel('Close Price')
plt.show()


### Time Series Decomposition

In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose

decomposition = seasonal_decompose(stock['Close'], model='additive')
decomposition.plot()
plt.show()


### Lag feature Creation

In [None]:
stock['Lag_1'] = stock['Close'].shift(1)
stock['Lag_2'] = stock['Close'].shift(2)
stock = stock.dropna()


### Relative Strength INdex (RSA)

In [None]:
def calculate_rsi(data, period=14):
    delta = data.diff()
    gain = (delta.where(delta > 0, 0)).rolling(window=period).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(window=period).mean()
    rs = gain / loss
    rsi = 100 - (100 / (1 + rs))
    return rsi

stock['RSI'] = calculate_rsi(stock['Close'])

plt.figure(figsize=(14, 7))
plt.plot(stock['RSI'], label='RSI', color='purple')
plt.axhline(70, linestyle='--', color='red')
plt.axhline(30, linestyle='--', color='green')
plt.title('Relative Strength Index (RSI)')
plt.xlabel('Date')
plt.ylabel('RSI')
plt.legend()
plt.grid(True)
plt.show()


### Bollinger Bands

In [None]:
stock['SMA_20'] = stock['Close'].rolling(window=20).mean()
stock['Bollinger_Upper'] = stock['SMA_20'] + (stock['Close'].rolling(window=20).std() * 2)
stock['Bollinger_Lower'] = stock['SMA_20'] - (stock['Close'].rolling(window=20).std() * 2)

plt.figure(figsize=(14, 7))
plt.plot(stock['Close'], label='Close Price')
plt.plot(stock['SMA_20'], label='20-Day SMA', color='orange')
plt.fill_between(stock.index, stock['Bollinger_Lower'], stock['Bollinger_Upper'], color='gray', alpha=0.3)
plt.title('Bollinger Bands')
plt.xlabel('Date')
plt.ylabel('Price')
plt.legend()
plt.grid(True)
plt.show()


### MACD (Moving Average Convergence Divergence)

In [None]:
stock['EMA_12'] = stock['Close'].ewm(span=12, adjust=False).mean()
stock['EMA_26'] = stock['Close'].ewm(span=26, adjust=False).mean()
stock['MACD'] = stock['EMA_12'] - stock['EMA_26']
stock['MACD_Signal'] = stock['MACD'].ewm(span=9, adjust=False).mean()

plt.figure(figsize=(14, 7))
plt.plot(stock['MACD'], label='MACD', color='blue')
plt.plot(stock['MACD_Signal'], label='MACD Signal', color='red')
plt.title('MACD and Signal Line')
plt.xlabel('Date')
plt.ylabel('MACD')
plt.legend()
plt.grid(True)
plt.show()


### Risk Analysis

In [None]:
import numpy as np

# Calculate daily returns
stock['Daily_Return'] = stock['Close'].pct_change().dropna()

# Calculate VaR at 95% confidence level
VaR_95 = np.percentile(stock['Daily_Return'].dropna(), 5)
print(f'Value at Risk (95% confidence level): {VaR_95}')


## 6. Data Preprocessing

In [None]:
# Convert the Date Column to datetime format
stock['Date'] = pd.to_datetime(stock['Date'])
stock.set_index('Date', inplace=True)


To enhance the model with features derived from the date...

In [None]:
# Create date based columnss
stock['Year'] = stock.index.year
stock['Month'] = stock.index.month
stock['Day'] = stock.index.day
stock['DayOfWeek'] = stock.index.dayofweek
stock['IsMonthStart'] = stock.index.is_month_start
stock['IsMonthEnd'] = stock.index.is_month_end


...and capture the effect of previous days' stock prices on the current price.

In [None]:
# Create lagged features
stock['Lag_1'] = stock['Close'].shift(1)
stock['Lag_2'] = stock['Close'].shift(2)
stock['Lag_5'] = stock['Close'].shift(5)
stock['Lag_10'] = stock['Close'].shift(10)
stock.dropna(inplace=True)

...then prepare for model building.

In [None]:
from sklearn.model_selection import train_test_split

features = stock.drop(['Close'], axis=1)
target = stock['Close']

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42, shuffle=False)

## 7. Model Building