<a href="https://colab.research.google.com/github/OweT1/tsastockmarket/blob/main/Stock_Market.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Files:
- data.parquet/data.csv (.parquet format is good for efficient loading and saving)
    - The main dataset containing stock price, trade volume, news events and news sentiment for S&P 500 companies during the period Oct 2020-Jul 2022
    - 217811 samples in total
    - Total 26 features per sample
    - Prediction Task:
        - Focus on prediction the following 4 features: "Open" (opening price on the day), "Close" (closing price of the day), "High" (highest price on that day), "Low" (lowest price on that day). If you predicting for day X, then you cannot use any of these 4 feature values of day X as model input
        - If you are predicting for day X, then you should also use some attributes of previous N (experiment with different values of N) days (remember that this is a time series task -> prediction of today is also affected by the occurrences of the near past). These attributes may include "Open", "Close", "High" and "Low" features as well.
- sp500wiki.parquet/sp500wiki.csv
    - List of S&P 500 companies as of July 2022 and various metadata in tabular format
    - Contains information for over 500 companies (524 rows in total)
    - 10 attributes per company (10 columns)
    - You can use these attributes as assisting feature when performing prediction task on a particular day for a particular company

# Additional Information:
- "Symbol" which denotes the company brand, is the common feature between the two datasets
- Missing values are there in the dataset
- Categorical, discrete and continuous attributes exist in the dataset
- The prediction tasks are regression tasks
- Make sure that train and test data have minimum information leakage (this is something you need to think about deeply)  

In [26]:
import pandas as pd
import numpy as np

In [27]:
from google.colab import drive
drive.mount("/content/drive")

!ls /content/drive/MyDrive/'Colab Datasets'/'Stock Market'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
data.csv  data.parquet	ReadMe.md  sp500wiki.csv  sp500wiki.parquet


In [28]:
df = pd.read_parquet("/content/drive/MyDrive/Colab Datasets/Stock Market/data.parquet")
sup = pd.read_parquet("/content/drive/MyDrive/Colab Datasets/Stock Market/sp500wiki.parquet")

In [29]:
print(df.head())
print(sup.head())

                  Open        High         Low       Close   Adj Close  \
Date                                                                     
2020-09-30  160.929993  163.100006  158.610001  160.179993  150.921692   
2020-10-01  160.669998  161.899994  157.720001  158.789993  149.612045   
2020-10-02  156.470001  161.940002  156.250000  160.360001  151.091309   
2020-10-05  162.250000  163.500000  161.759995  162.750000  153.343170   
2020-10-06  163.440002  165.699997  161.830002  162.229996  152.853195   

               Volume Symbol Security  GICS Sector         GICS Sub-Industry  \
Date                                                                           
2020-09-30  3056900.0    MMM       3M  Industrials  Industrial Conglomerates   
2020-10-01  1989100.0    MMM       3M  Industrials  Industrial Conglomerates   
2020-10-02  1768600.0    MMM       3M  Industrials  Industrial Conglomerates   
2020-10-05  1457000.0    MMM       3M  Industrials  Industrial Conglomerates   
2

In [30]:
print(df.dtypes)

Open                             float64
High                             float64
Low                              float64
Close                            float64
Adj Close                        float64
Volume                           float64
Symbol                            object
Security                          object
GICS Sector                       object
GICS Sub-Industry                 object
News - All News Volume           float64
News - Volume                    float64
News - Positive Sentiment        float64
News - Negative Sentiment        float64
News - New Products              float64
News - Layoffs                   float64
News - Analyst Comments          float64
News - Stocks                    float64
News - Dividends                 float64
News - Corporate Earnings        float64
News - Mergers & Acquisitions    float64
News - Store Openings            float64
News - Product Recalls           float64
News - Adverse Events            float64
News - Personnel

In [32]:
print(min(df.index))

2020-09-30 00:00:00


# Check for NA Values
At first, I thought that it might have been for the starting date of each stock, but it turns out to all correlate with the data points on 2020-09-30, which turns out to be the starting point of all our data.

In [33]:
print(len(pd.unique(df['Symbol'])))
print(df.isna().sum())

495
Open                               0
High                               0
Low                                0
Close                              0
Adj Close                          0
Volume                             0
Symbol                             0
Security                           0
GICS Sector                        0
GICS Sub-Industry                  0
News - All News Volume           493
News - Volume                    493
News - Positive Sentiment        493
News - Negative Sentiment        493
News - New Products              493
News - Layoffs                   493
News - Analyst Comments          493
News - Stocks                    493
News - Dividends                 493
News - Corporate Earnings        493
News - Mergers & Acquisitions    493
News - Store Openings            493
News - Product Recalls           493
News - Adverse Events            493
News - Personnel Changes         493
News - Stock Rumors              493
dtype: int64


In [36]:
print(df[df.index == "2020-09-30"])
print(df[df.index == "2020-09-30"].isna().sum())

                  Open        High         Low       Close   Adj Close  \
Date                                                                     
2020-09-30  160.929993  163.100006  158.610001  160.179993  150.921692   
2020-09-30   53.709999   54.020000   52.720001   52.799999   51.231228   
2020-09-30  105.989998  109.480003  105.739998  108.830002  106.019371   
2020-09-30   87.000000   88.440002   86.809998   87.589996   80.923454   
2020-09-30  275.119995  280.880005  272.220001  277.059998  277.059998   
...                ...         ...         ...         ...         ...   
2020-09-30   91.120003   92.529999   90.760002   91.300003   88.541023   
2020-09-30  258.070007  259.390015  250.449997  252.460007  252.460007   
2020-09-30  129.902908  133.912628  129.699036  132.174759  130.564957   
2020-09-30   29.080000   29.570000   28.870001   29.219999   27.909163   
2020-09-30  162.919998  166.789993  162.750000  165.369995  163.708176   

               Volume Symbol         