# Australian Stock Prices Prediction

### Author: Phuc An Nguyen 

**August 2023**

## 1. Introduction

Stock market analysis and prediction is an important part of the stock trading and investing process. The benefits from performing this analysis include: Making smart investment decisions, Investment orientation, Understand the business, Risk forecast, Investment strategy orientation and Monitor investment performance.

This project will use Descriptive analysis and Exploratory analysis to identify patterns in time series data, like trends, cycles, or seasonal variation as well as highlights the main characteristics of the time series data. Futhermore, I will use the Forecasting model for predicting future data based on the historical data with different types. Then I will evaluate and compare the performance of these models against each other and choose the best model. Finally, I will answer the inspiration questions raised at the beginning of the project.

### 1.1. Data description 

This dataset contains historical share price data from the top 100 companies (with 100 different CSV files) listed on the Australian Securities Exchange, each data file contains:
- **Date**: date
- **Open**: opening price
- **High**: high price
- **Low**: low price
- **Close**: closing price
- **Adj Close**: adjusted closing price (including dividends)
- **Volume**: trading volume

### 1.2. Inspiration questions

1. Which stocks have high value, high trading volume and low volatility?
2. What periods of the year are stocks most volatile?
3. Investors can profit by making smart decisions about buying, selling or holding stocks by predicting stock trends.
4. What factors affect the rise or fall of stocks?

### 1.3. Import libraries

In [35]:
# Import libraries
import pandas as pd
import os

import warnings
warnings.filterwarnings('ignore')

## 2. Read multiple CSV file 

In [2]:
# Set the path to the folder containing the CSV files
folder_path = 'dataset'

# Create an empty list to store the data from all the CSV files
data_frames = []

# Create an empty list to store the names of the files
file_names = []

# Loop through the files in the folder, check if they are CSV files
for file_name in os.listdir(folder_path):
    if file_name.endswith('.csv'):
        file_path = os.path.join(folder_path, file_name)
        df = pd.read_csv(file_path)
        # Add the name of the file as a new column to the DataFrame
        df['Company'] = file_name
        data_frames.append(df)
        # Store the name of the file in the file_names list
        file_names.append(file_name)

# After the loop, data_frames will contain all the DataFrames from the CSV files in the folder
# file_names will contain the names of all the CSV files in the folder
# To combine all the DataFrames into a single DataFrame
combined_df = pd.concat(data_frames, ignore_index=True)

## 3. Exploratory Data Analysis (EDA)

### 3.1. Data Exploration

In [3]:
combined_df.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Company
0,2015-03-31,0.555,0.595,0.53,0.565,0.565,4816294.0,A2M.csv
1,2015-04-01,0.575,0.58,0.555,0.565,0.565,4376660.0,A2M.csv
2,2015-04-02,0.56,0.565,0.535,0.555,0.555,2779640.0,A2M.csv
3,2015-04-07,0.545,0.55,0.54,0.545,0.545,392179.0,A2M.csv
4,2015-04-08,0.545,0.545,0.53,0.54,0.54,668446.0,A2M.csv


In [4]:
# Check dataframe shape
combined_df.shape

(432888, 8)

In [5]:
# Check null
combined_df.isnull().sum()

Date            0
Open         1527
High         1527
Low          1527
Close        1527
Adj Close    1527
Volume       1527
Company         0
dtype: int64

In [6]:
# Check information dataframe
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 432888 entries, 0 to 432887
Data columns (total 8 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   Date       432888 non-null  object 
 1   Open       431361 non-null  float64
 2   High       431361 non-null  float64
 3   Low        431361 non-null  float64
 4   Close      431361 non-null  float64
 5   Adj Close  431361 non-null  float64
 6   Volume     431361 non-null  float64
 7   Company    432888 non-null  object 
dtypes: float64(6), object(2)
memory usage: 26.4+ MB


In [7]:
# summarizes dataframe
combined_df.describe()

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume
count,431361.0,431361.0,431361.0,431361.0,431361.0,431361.0
mean,13.628159,13.765587,13.486251,13.62658,10.388578,3646548.0
std,19.325604,19.506007,19.140934,19.325726,17.343674,7011357.0
min,0.004,0.004,0.004,0.004,0.002449,0.0
25%,3.49,3.53073,3.45,3.49,2.229591,587909.0
50%,7.2,7.27548,7.11,7.19675,5.00139,1799148.0
75%,15.8887,16.040001,15.698,15.8667,11.451574,4207143.0
max,339.420013,342.75,337.029999,341.0,339.37619,993018300.0


In [13]:
# Check negative or zero value
(combined_df[['Open','High','Low','Close','Adj Close']].values <= 0).any()

False

In [None]:
# Check outlier

### 3.2. Data Wrangling

Removing Rows with Null Values

In [29]:
# Removing all Null rows in dataframe
df = combined_df.dropna()

Change type Date column

In [36]:
# Change the type of Date column from string to datetime
df['Date'] = pd.to_datetime(df['Date'])

Change Company value

In [38]:
# Change the Company name without .csv 
df['Company'] = df['Company'].str.replace('.csv', '')

Removing outlier

Change Date into Weekday and check whether they have weekend 

### 3.3. Data Visualization

Correlation Matrix

Scatterplot

Histogram

Boxplot

## 4. Methodology

### 4.1. Modelling methods

- **Time Series Models** with ARIMA (AutoRegressive Integrated Moving Average) và SARIMA (Seasonal ARIMA)
- **Regression Model** with Linear Regression model
- **Deep Learning Models** with Long Short-Term Memory (LSTM)
- **Ensemble Models** with Gradient Boosting Machines (GBM)

### 4.2. Model evaluation methods

### 4.3. Data preparation

## 5. Model Development 

https://neptune.ai/blog/select-model-for-time-series-prediction-task

https://www.diva-portal.org/smash/get/diva2:1719774/FULLTEXT01.pdf

### 5.1. ARIMA (AutoRegressive Integrated Moving Average)

### 5.2. SARIMA (Seasonal ARIMA)

### 5.3. Linear Regression model

### 5.4. Long Short-Term Memory (LSTM)

### 5.5. Gradient Boosting Machines (GBM)

## 6. Comparison and Result

## 7. Conclusion

Trả lời những câu hỏi được nêu trên

### References

The dataset is collected from source:
https://www.kaggle.com/datasets/ashbellett/australian-historical-stock-prices?resource=download

https://towardsdatascience.com/an-extensive-guide-to-exploratory-data-analysis-ddd99a03199e

https://www.kaggle.com/code/prakharrathi25/exploratory-data-analysis-step-by-step
