In [13]:
import os
import pandas as pd
import yahoo_fin.stock_info as si

Function that will take the normal price history and output a new dataframe containing the features.

The features are the sector of the company being evaluated and the post/pre market price change percentage.
The 'day_change' column will be used to generate a label in the completed test dataset.

In [14]:
# function to read price data and output new dataframe containing premarket change by date
# data to generate/retain:
    # overnight change percentage
    # following day change percentage
def feature_label(historical_prices: pd.DataFrame, filename: str) -> pd.DataFrame:
    # length of dataframe and lists
    length = historical_prices.shape[0] # num of rows

    # list of all opening and closing historical prices
    price_open = historical_prices['open'].tolist()
    price_close = historical_prices['close'].tolist()

    # dataframe that will be returned
    result_data = pd.DataFrame(columns=('sector', 'premarket_change', 'day_change'))

    company = filename[:-4]

    # getting sector of company
    sector = si.get_company_info(company).loc['sector']['Value']

    # iterating through the lists, determining changes by %, and adding to result DF
    # adds sector feature
    for i in range(length - 1):
        # change in price overnight / previous close price
        premarket_change = ( (price_open[i + 1] - price_close[i]) / price_close[i] ) * 100

        # change in price during day / opening price of day
        day_change = ( (price_close[i + 1] - price_open[i + 1]) / price_open[i + 1] ) * 100

        result_data.loc[i] = [sector, premarket_change, day_change]

    return result_data

Calling 'feature_label' function on every csv file in the given directory to process and store the new data to the clean_stock_data directory.

In [None]:
n = 1
for filename in os.listdir('raw_stock_data'):
    
    # easy tracking of progress because I'm lazy
    print(f"getting data for {filename}... {n}/500")
    n += 1
    
    data = pd.read_csv(f'raw_stock_data/{filename}')
    result = feature_label(data, filename)
    result.to_csv(path_or_buf=f"clean_stock_data/{filename}")

Function to identify the appropriate label for each row. This was added afterwards when I realize that this classification model needs class labels. Rather than refactoring the code and data completely, I decided to build this to append the labels more quickly.

In [17]:
# function to determine label to apply to data pair
def get_class(row) -> str:
    if row['day_change'] < 0:
        return 'decrease'
    elif row['day_change'] > 0:
        return 'increase'
    else:
        return 'minimal'

First, all the data is concatenated into a single dataframe. This is done in order to fit the model to the complete dataset.

We then call the 'get_class' function inside a lambda function to assign the appropraite label to each row in the dataset.

Lastly, the completed dataset is stored as a csv for easy read/write access.

In [None]:
# Collect all cleaned data and concat into a single DataFrame for fitting the model

# list to track all referenced to dataframes of stock data
df_tracker = []

# iterating through directory to pull in all csv files as dataframes and storing references in df_tracker
n = 1
for filename in os.listdir('clean_stock_data'):
    
    # easy tracking of progress because I'm lazy part 2 hehe
    print(f"parsing data from {filename} to features and labels... {n}/500")
    n += 1

    try:
        data = pd.read_csv(f'clean_stock_data/{filename}')
        df_tracker.append(data)
    except:
        print(f"Failed to copy data from {filename} to full_df...")

# concatenating all the data into a single set
full_df = pd.concat(df_tracker, ignore_index=True)

# adding label to each row
full_df['label'] = full_df.apply(lambda row: get_class(row), axis=1)

# exporting complete dataset to csv
full_df.to_csv(path_or_buf='complete_datasets/complete_stock_ds.csv')