Feature 1: Combination of stock market implied volatility ("^VIX") & oil volatility ("CL=F") + S&P 500 ("^GSPC")

Zhifeng Dai, Huiting Zhou, Xiaodi Dong, Jie Kang, "Forecasting Stock Market Volatility: A Combination Approach", Discrete Dynamics in Nature and Society, vol. 2020, Article ID 1428628, 9 pages, 2020. https://doi.org/10.1155/2020/1428628

Feature 2: Market Reaction at the time of the announcement (from news or sth else)

https://www.jstor.org/stable/2330704 || http://stock.finance.sina.com.cn/stock/go.php/vReport_Show/kind/lastest/rptid/667311117237/index.phtml

Feature 3: SIC Code (Top 2: Industry Classification)

https://www.sec.gov/corpfin/division-of-corporation-finance-standard-industrial-classification-sic-code-list

Feature 4: Financial Ratio (e.g., top-18 or half of the total)

Ravisankar, P., Ravi, V., Rao, G. R., & Bose, I. (2011). Detection of financial statement fraud and feature selection using data mining techniques. Decision support systems, 50(2), 491-500.

Feature 5: Statistical characteristic (e.g., Volume\/Volatility\/Trend\/Momentum\/Other)

https://github.com/bukosabino/ta || https://technical-analysis-library-in-python.readthedocs.io/en/latest/


In [3]:
import os 
import re
import time
#import swifter
import pandas as pd
import numpy as np
import datetime as dt
import scipy.stats as st
import yfinance as yf
from sklearn import linear_model, preprocessing
from dateutil.relativedelta import relativedelta
#from yahoofinancials import YahooFinancials
from ta import add_all_ta_features
from ta.utils import dropna

# Parameters Setting
alpha = 0.3
beta = 0.3
gamma = 1- alpha - beta
a = 0.1
b = 0.9

# Data Preprocessing

In [4]:
# Explain how you handle missing values.
# Explain how you identify outliers and handle outliers.
# Explain how you transform your dataset (scaling, taking log) and the effects of your efforts in charts or in words.
# Encoding of categorical variables.

df = pd.read_csv("../stock_text_y_after_discard/final_v1_df.csv")
df.columns = ['id'] + df.columns[1:].tolist()
df.set_index(['id'], inplace=True)
# df = df.dropna()
# df.head()

In [6]:
df.head()

Unnamed: 0_level_0,time,risk_factor,item1,10K_report_date,ticker,item7,growth,y,sic,sic_letter
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,2005,Any decline in demand for public water transmi...,Item 1. \n\n Business \n\n General \n\n North...,20060314,nwpx,Item 7. \n\n Management s Discussion and Anal...,1.558627,3,3317,D
1,2006,A downturn in government spending related to p...,Item 1. Business We are a leading North Ame...,20070402,nwpx,Table of Contents \n\n Item 7. Management s...,1.175339,2,3317,D
2,2007,A downturn in government spending related to p...,Item 1. Business We are a leading North Ame...,20080317,nwpx,Table of Contents \n\n Item 7. Management s...,0.709823,0,3317,D
3,2008,The success of our business is affected by gen...,Item 1. Business We are a leading North Ame...,20090313,nwpx,Table of Contents \n\n Item 7. Management s...,0.566531,0,3317,D
4,2010,Our Audit Committee and management have identi...,Item 1. Business Overview We are a leadin...,20110322,nwpx,\n Item 7. Management s Discussion and Analy...,1.331586,2,3317,D


In [35]:
df['item7'].iloc[1]

' Table of Contents \n\n Item 7.   Management s Discussion and Analysis of Financial Condition and Results of Operations Forward-Looking Statements This Management s Discussion and Analysis of Financial Condition and Results of Operations and other sections of this Report contain forward-looking statements within the meaning of the Securities Litigation Reform Act of 1995 that are based on current expectations, estimates and projections about our business, management s beliefs, and assumptions made by management. Words such as expects, anticipates, intends, plans, believes, seeks, estimates, should,  and variations of such words and similar expressions are intended to identify such forward-looking statements. These statements are not guarantees of future performance and involve risks and uncertainties that are difficult to predict. Therefore, actual outcomes and results may differ materially from what is expressed or forecasted in such forward-looking statements due to numerous factors

In [5]:
# Feature 1 Creation: Combination of stock market implied volatility ("^VIX") & oil volatility ("CL=F") & S&P 500 ("^GSPC")
# Calculate => volatility = (alpha * ^VIX + beta * CL=F) / gamma * S&P 500
list = []
for i in range(2005,2021,1):
    stime = str(i)+'-01-01'
    etime = str(i+1)+'-01-01'
    data = yf.download("^GSPC ^VIX CL=F", start=stime, end=etime)
    tmp1 = data['Adj Close']['CL=F'].dropna()
    tmp2 = data['Adj Close']['^VIX'].dropna()
    tmp3 = data['Adj Close']['^GSPC'].dropna()
    tmp1 = tmp1.sum()/len(tmp1)
    tmp2 = tmp2.sum()/len(tmp2)
    tmp3 = tmp3.sum()/len(tmp3)
    list.append((i, tmp1, tmp2, tmp3))

[*********************100%***********************]  3 of 3 completed
[*********************100%***********************]  3 of 3 completed
[*********************100%***********************]  3 of 3 completed
[*********************100%***********************]  3 of 3 completed
[*********************100%***********************]  3 of 3 completed
[*********************100%***********************]  3 of 3 completed
[*********************100%***********************]  3 of 3 completed
[*********************100%***********************]  3 of 3 completed
[*********************100%***********************]  3 of 3 completed
[*********************100%***********************]  3 of 3 completed
[*********************100%***********************]  3 of 3 completed
[*********************100%***********************]  3 of 3 completed
[*********************100%***********************]  3 of 3 completed
[*********************100%***********************]  3 of 3 completed
[*********************100%********

In [122]:
l = pd.DataFrame(list)
l.columns=['time','CL=F','^VIX','^GSPC']
def min_max_process(list):
    tmp = []
    min = list.min()
    max = list.max()
    for i,v in list.items():
        std = (v - min)/(max - min)
        if(std==0): std = 1e-6
        tmp.append(std)
    return tmp
l.iloc[:,1] = min_max_process(l['CL=F'])
l.iloc[:,2] = min_max_process(l['^VIX'])
l.iloc[:,3] = min_max_process(l['^GSPC'])
volatility = []
for i, r in l.iterrows():
    v = (alpha * r['^VIX'] + beta * r['CL=F']) / gamma * r['^GSPC']
    volatility.append(v)
volatility = pd.DataFrame(volatility)
l = pd.concat([l,volatility],axis=1)
l.columns=['time','CL=F','^VIX','^GSPC','volatility']
# l.set_index(['time'],inplace=True)
# l.to_csv('volatility_index.csv')

In [131]:
# Add Feature 1 into table
df['volatility'] = 0
for i,r in df.iterrows():
    t = r['time']
    df.loc[i,'volatility'] = l.loc[t-2005,:][4]
# df

Unnamed: 0_level_0,time,risk_factor,item1,10K_report_date,ticker,item7,growth,y,sic,sic_letter,volatility
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,2005,Any decline in demand for public water transmi...,Item 1. \n\n Business \n\n General \n\n North...,20060314,nwpx,Item 7. \n\n Management s Discussion and Anal...,1.558627,3,3317,D,0.031378
1,2006,A downturn in government spending related to p...,Item 1. Business We are a leading North Ame...,20070402,nwpx,Table of Contents \n\n Item 7. Management s...,1.175339,2,3317,D,0.062822
2,2007,A downturn in government spending related to p...,Item 1. Business We are a leading North Ame...,20080317,nwpx,Table of Contents \n\n Item 7. Management s...,0.709823,0,3317,D,0.147784
3,2008,The success of our business is affected by gen...,Item 1. Business We are a leading North Ame...,20090313,nwpx,Table of Contents \n\n Item 7. Management s...,0.566531,0,3317,D,0.180493
4,2010,Our Audit Committee and management have identi...,Item 1. Business Overview We are a leadin...,20110322,nwpx,\n Item 7. Management s Discussion and Analy...,1.331586,2,3317,D,0.075963
...,...,...,...,...,...,...,...,...,...,...,...
6473,2013,Although we have chosen to pursue conversion t...,"ITEM 1. BUSINESS The words Equinix , we , ou...",20140228,eqix,\n ITEM 7. MANAGEMENT S DISCUSSION AND ANALY...,1.911720,4,6798,H,0.256716
6474,2014,We may not qualify or remain qualified as a RE...,"ITEM 1. BUSINESS The words Equinix , we , ou...",20150302,eqix,\n ITEM 7. MANAGEMENT S DISCUSSION AND ANALY...,2.184554,4,6798,H,0.334596
6475,2016,Consummation of the Verizon Asset Purchase is ...,"ITEM 1. BUSINESS The words Equinix , we ,...",20170227,eqix,\n ITEM 7. MANAGEMENT S DISCUSSION AND ANALY...,1.366027,3,6798,H,0.108364
6476,2017,"Acquisitions present many risks, and we may no...",Table of Contents ITEM 1. BUSINESS Th...,20180226,eqix,\n ITEM 7. MANAGEMENT S DISCUSSION AND ANALY...,1.487546,3,6798,H,0.093951


In [None]:
# Feature 2: Market Reaction at the time of the announcement (based on the event study method)
# observe window is 252 days, and event window is [-20, 20]
def do_event_study(data_ret, eventdate, ticker, estimation_period=252, before_event=20, event_window_start=-20,
                   event_window_end=20, benchmark="^GSPC"):

    # Generate post-event indicator
    data_ret["post_event"] = (data_ret["Date"] >= eventdate).astype(int)  # 1 if after event, 0 otherwise
    data_ret = (
        data_ret.reset_index()
    )  # pushes out the current index column and create a new one

    # Identify the index for the event date
    event_date_index = data_ret.groupby(["post_event"])["index"].transform("min").max()
    data_ret["event_date_index"] = event_date_index

    # Create the variable day relative to event
    data_ret["rel_day"] = data_ret["index"] - data_ret["event_date_index"]

    # Identify estimation period
    estimation = data_ret[
        (data_ret["rel_day"] < -before_event)
        & (data_ret["rel_day"] >= -estimation_period - before_event)
        ]

    # Identify event period
    event = data_ret[
        (data_ret["rel_day"] <= event_window_end)
        & (data_ret["rel_day"] >= event_window_start)
        ]

    # Calculate expected returns with the market model
    x_df = estimation[benchmark].values.reshape(-1, 1)

    # Create an empty list to store betas
    betas = []

    # Calculate betas for the market model
    for y in [benchmark, ticker]:
        y_df = estimation[y].values.reshape(-1, 1)
        reg = linear_model.LinearRegression()
        betas.append(reg.fit(x_df, y_df).coef_)

    # Convert the list to a Numpy Array
    beta_np = np.array(betas)
    # beta_np

    # Expected Returns via Beta
    # Need Numpy Array to do Calculations!
    sp500array = event[benchmark].values
    expected_returns = np.outer(sp500array, beta_np)
    expected_returns = pd.DataFrame(expected_returns, index=event.index)
    expected_returns.columns = [benchmark, ticker]
    expected_returns = expected_returns.rename(columns={ticker: "expected_return"})
    del expected_returns[benchmark]

    # Abnormal Returns
    event = pd.concat([event, expected_returns], axis=1, ignore_index=False)

    event["abnormal_return"] = event[ticker] - event["expected_return"]

    # Event CAR
    winar1 = event[(event["rel_day"] <= 1) & (event["rel_day"] >= -1)][
        "abnormal_return"
    ].sum()  # CAR[-1,+1]
    winar2 = event[(event["rel_day"] <= 1) & (event["rel_day"] >= 0)][
        "abnormal_return"
    ].sum()  # CAR[0,+1]

    # Day-by-day AR
    winar3 = event[(event["rel_day"] <= -1) & (event["rel_day"] >= -1)][
        "abnormal_return"
    ].sum()  # Event Day -1
    winar4 = event[(event["rel_day"] <= 0) & (event["rel_day"] >= 0)][
        "abnormal_return"
    ].sum()  # Event Day 0
    winar5 = event[(event["rel_day"] <= 1) & (event["rel_day"] >= 1)][
        "abnormal_return"
    ].sum()  # Event Day 1

    # Post Event CAR
    winar6 = event[(event["rel_day"] <= 5) & (event["rel_day"] >= 2)][
        "abnormal_return"
    ].sum()  # CAR[2,5]
    winar7 = event[(event["rel_day"] <= 10) & (event["rel_day"] >= 2)][
        "abnormal_return"
    ].sum()  # CAR[2,10]
    winar8 = event[(event["rel_day"] <= 20) & (event["rel_day"] >= 2)][
        "abnormal_return"
    ].sum()  # CAR[2,20]

    # Pre Event CAR
    winar9 = event[(event["rel_day"] <= -2) & (event["rel_day"] >= -5)][
        "abnormal_return"
    ].sum()  # CAR[-5,-2]
    winar10 = event[(event["rel_day"] <= -2) & (event["rel_day"] >= -10)][
        "abnormal_return"
    ].sum()  # CAR[-10,-2]
    winar11 = event[(event["rel_day"] <= -2) & (event["rel_day"] >= -20)][
        "abnormal_return"
    ].sum()  # CAR[-20,-2]

    return (
        winar1,
        winar2,
        winar3,
        winar4,
        winar5,
        winar6,
        winar7,
        winar8,
        winar9,
        winar10,
        winar11,
    )

def winar_calculation(list):
    pos = 0
    neg = 0
    for i,r in list.iterrows():
        if(r[0]>0): pos +=1
        if(r[0]<0): neg +=1
    return pos, neg


# Event Studies for-loop
# Add Feature 2 into table

# Init
# df['reaction_positive'] = 0
# df['reaction_negative'] = 0
for i,r in df.iterrows():
    t = str(r['10K_report_date'])
    symbol = r['ticker'].upper()
    symbols_list = ['^GSPC', symbol]
    t = dt.datetime.strptime(t, '%Y%m%d')
    start = (t- relativedelta(years=1)).strftime("%Y-%m-%d")
    end = (t+ relativedelta(years=1)).strftime("%Y-%m-%d")
    data = yf.download(symbols_list, start=start, end=end)

    # Calculate returns
    main_data = data["Adj Close"] / data["Adj Close"].shift(1) - 1
    main_data = main_data.dropna()
    main_data = main_data.reset_index()
    cars = []
    cars.append(do_event_study(main_data, ticker=symbol, eventdate=t))
    cars = pd.DataFrame(cars)
    cars = cars.T
    p,n = winar_calculation(cars)
    df.loc[i,'reaction_positive'] = p
    df.loc[i,'reaction_negative'] = n

In [314]:
# Feature 5: Statistical Characteristic
# Based on the Technical Analysis Library, whose name is "ta"
# Focus on 5 parts: Volume\Volatility\Trend\Momentum\Others
# 'volume_adi','volume_mfi','volatility_atr','trend_sma_slow','trend_macd','momentum_rsi','momentum_roc','momentum_ppo','others_dlr','others_cr'

tickers = pd.read_table('./Data/final_temp_ticker.txt',header=None)
tickers = tickers[0].tolist()
for symbol in tickers:
    data_tmp = yf.download(symbol, start=dt.datetime(2004, 1, 1), end=dt.datetime(2020, 12, 31))
    data_tmp = dropna(data_tmp)
    data_tmp = add_all_ta_features(data_tmp, open="Open", high="High", low="Low", close="Adj Close", volume="Volume")
    data_tmp = pd.concat([data_tmp['volume_adi'],data_tmp['volume_mfi'],data_tmp['volatility_atr'],data_tmp['trend_sma_slow'],data_tmp['trend_macd'],data_tmp['momentum_rsi'],data_tmp['momentum_roc'],data_tmp['momentum_ppo'],data_tmp['others_dlr'],data_tmp['others_cr']],axis=1)
    filename = str(symbol).lower() + '.csv'
    data_tmp.to_csv('./Data/stock_indicators/'+filename)

data_tickers = pd.DataFrame(columns=['ticker','time','volume_adi','volume_mfi','volatility_atr','trend_sma_slow','trend_macd','momentum_rsi','momentum_roc','momentum_ppo','others_dlr','others_cr'])
for symbol in tickers:
    result = []
    filename = symbol + '.csv'
    s = 0
    s = pd.read_csv('./Data/stock_indicators/'+filename)
    s.set_index('Date',inplace=True)
    for i in range(2006,2021,1):
        start = str(i) + '-01-01'
        end = str(i+1) + '-01-01'
        x = s[start:end]
        tmp = x.mean()
        tmp = tmp.tolist()
        r = []
        r.append(symbol)
        r.append(str(i))
        for t in tmp:
            r.append(t)
        result.append(r)
    result = pd.DataFrame(result)
    result.columns=['ticker','time','volume_adi','volume_mfi','volatility_atr','trend_sma_slow','trend_macd','momentum_rsi','momentum_roc','momentum_ppo','others_dlr','others_cr']
    data_tickers = pd.concat([data_tickers,result],axis=0)
data_tickers = data_tickers.dropna()
# data_tickers.to_csv('final_indicators.csv')

In [None]:
def search_indicators(ticker,time):
    for i,r in data_tickers.iterrows():
        if(r['time']==str(time) and r['ticker']==ticker):
            k=[]
            k.append(r['volume_adi'])
            k.append(r['volume_mfi'])
            k.append(r['volatility_atr'])
            k.append(r['trend_sma_slow'])
            k.append(r['trend_macd'])
            k.append(r['momentum_rsi'])
            k.append(r['momentum_roc'])
            k.append(r['momentum_ppo'])
            k.append(r['others_dlr'])
            k.append(r['others_cr'])
            return k
    return [None,None,None,None,None,None,None,None,None,None]
        

for i,r in df.iterrows():
    ticker = r['ticker']
    time = r['time']
    list = search_indicators(ticker,time)
    df.loc[i,'volume_adi'] = list[0]
    df.loc[i,'volume_mfi'] = list[1]
    df.loc[i,'volatility_atr'] = list[2]
    df.loc[i,'trend_sma_slow'] = list[3]
    df.loc[i,'trend_macd'] = list[4]
    df.loc[i,'momentum_rsi'] = list[5]
    df.loc[i,'momentum_roc'] = list[6]
    df.loc[i,'momentum_ppo'] = list[7]
    df.loc[i,'others_dlr'] = list[8]
    df.loc[i,'others_cr'] = list[9]

df = df.replace([np.inf, -np.inf], np.nan).dropna()
# df

In [363]:
# df.to_csv('final_v3_df.csv')
df

Unnamed: 0_level_0,time,risk_factor,item1,10K_report_date,ticker,item7,growth,y,sic,sic_letter,...,volume_adi,volume_mfi,volatility_atr,trend_sma_slow,trend_macd,momentum_rsi,momentum_roc,momentum_ppo,others_dlr,others_cr
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,2006,A downturn in government spending related to p...,Item 1. Business We are a leading North Ame...,20070402,nwpx,Table of Contents \n\n Item 7. Management s...,1.175339,2,3317,D,...,-3.515938e+05,50.359009,0.792847,28.379537,0.178588,53.106931,1.249287,0.585889,0.090921,115.647695
2,2007,A downturn in government spending related to p...,Item 1. Business We are a leading North Ame...,20080317,nwpx,Table of Contents \n\n Item 7. Management s...,0.709823,0,3317,D,...,-1.440436e+06,57.280553,1.412616,35.654050,0.098563,51.766881,0.861268,0.262177,0.060567,169.417288
3,2008,The success of our business is affected by gen...,Item 1. Business We are a leading North Ame...,20090313,nwpx,Table of Contents \n\n Item 7. Management s...,0.566531,0,3317,D,...,-1.414003e+06,53.844115,2.589620,43.180485,-0.106460,52.458817,1.131071,-0.514644,0.033575,223.662843
4,2010,Our Audit Committee and management have identi...,Item 1. Business Overview We are a leadin...,20110322,nwpx,\n Item 7. Management s Discussion and Analy...,1.331586,2,3317,D,...,-4.214262e+06,50.947005,0.922209,21.281775,-0.128362,48.554472,-0.209447,-0.584321,-0.044181,58.650848
5,2011,Our Audit Committee and management have identi...,Item 1. Business Overview We are a leadin...,20120427,nwpx,\n Item 7. Management s Discussion and Analy...,1.252679,2,3317,D,...,-4.860406e+06,49.961210,1.204679,24.127654,0.026656,50.792482,0.489925,0.078676,-0.019807,81.374244
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6473,2013,Although we have chosen to pursue conversion t...,"ITEM 1. BUSINESS The words Equinix , we , ou...",20140228,eqix,\n ITEM 7. MANAGEMENT S DISCUSSION AND ANALY...,1.911720,4,6798,H,...,-2.201136e+10,47.417896,40.392765,155.499941,-0.563620,48.492309,-0.597432,-0.418673,-0.059587,582.131185
6474,2014,We may not qualify or remain qualified as a RE...,"ITEM 1. BUSINESS The words Equinix , we , ou...",20150302,eqix,\n ITEM 7. MANAGEMENT S DISCUSSION AND ANALY...,2.184554,4,6798,H,...,-2.638618e+10,59.570540,40.465728,160.913902,1.485700,57.344488,1.629953,0.909879,0.112526,622.867336
6475,2016,Consummation of the Verizon Asset Purchase is ...,"ITEM 1. BUSINESS The words Equinix , we ,...",20170227,eqix,\n ITEM 7. MANAGEMENT S DISCUSSION AND ANALY...,1.366027,3,6798,H,...,-3.219974e+10,49.605018,38.301655,306.367418,1.477140,54.232824,1.026203,0.496825,0.074632,1265.705500
6476,2017,"Acquisitions present many risks, and we may no...",Table of Contents ITEM 1. BUSINESS Th...,20180226,eqix,\n ITEM 7. MANAGEMENT S DISCUSSION AND ANALY...,1.487546,3,6798,H,...,-3.375959e+10,56.892858,38.925276,388.472880,3.048024,58.235279,1.317394,0.816853,0.102055,1638.409575


# Features Engineering

In [28]:
import nltk
import pysentiment as ps
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import RegexpTokenizer, sent_tokenize

#df = pd.read_csv('./final_v3_df.csv', index_col=['id'])
# df

In [16]:
# Data Cleaning || instead of re.sub(r"\s+", " ", x.strip()))
def clean_data(bdata):
    bdata = str(bdata)
    var = re.sub(r"[&|\,|\.|\?|\!|\(|\)|\n|\/]","",bdata).strip()
    var = re.sub(r"[\']","",var.lower())
    var = re.sub(r"[\“\”\$\`\`\''\``]","",var)
    var = var.replace("''","")
    var = var.replace("``","")
    var = re.sub(r"[\-]","",var)
    var = re.sub(r"[\d+]","",var)
    var = re.sub(r"[\*]{,8}","",var)
    var = re.sub(r"[\—]","",var)
    var = re.sub(r"[\:]","",var)
    var = re.sub(r"[\;\%\-]","",var)
    var = re.sub("[^a-zA-z0-9\s']"," ", var)
    var = re.sub("[']","", var)
    var = re.sub("[\\\\]","", var)
    var = re.sub(r"mw","", var)
    var = re.sub(r"``","", var)
    var = re.sub(r"\n","", var)
    var = re.sub(r"[nn]{2,20}","",var)
    var = re.sub(r"['-']{,20}","",var)
    var = re.sub(r"['q']{2,20}","",var)
    var = re.sub(r"['[[']{1,20}","",var)
    var = re.sub(r"[\]\]]","",var).strip()

    return var

df['risk_factor'] = df['risk_factor'].apply(lambda x: clean_data(x))
df['item1'] = df['item1'].apply(lambda x: clean_data(x))
df['item7'] = df['item7'].apply(lambda x: clean_data(x))
df

Unnamed: 0_level_0,time,risk_factor,item1,10K_report_date,ticker,item7,growth,y,sic,sic_letter,...,volume_adi,volume_mfi,volatility_atr,trend_sma_slow,trend_macd,momentum_rsi,momentum_roc,momentum_ppo,others_dlr,others_cr
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,2006,a downturn in government spending related to p...,item business we are a leading north americ...,20070402,nwpx,table of contents item management s discus...,1.175339,2,3317,D,...,-3.515938e+05,50.359009,0.792847,28.379537,0.178588,53.106931,1.249287,0.585889,0.090921,115.647695
2,2007,a downturn in government spending related to p...,item business we are a leading north americ...,20080317,nwpx,table of contents item management s discus...,0.709823,0,3317,D,...,-1.440436e+06,57.280553,1.412616,35.654050,0.098563,51.766881,0.861268,0.262177,0.060567,169.417288
3,2008,the success of our business is affected by gen...,item business we are a leading north americ...,20090313,nwpx,table of contents item management s discus...,0.566531,0,3317,D,...,-1.414003e+06,53.844115,2.589620,43.180485,-0.106460,52.458817,1.131071,-0.514644,0.033575,223.662843
4,2010,our audit committee and management have identi...,item business overview we are a leading n...,20110322,nwpx,item management s discussion and analysis o...,1.331586,2,3317,D,...,-4.214262e+06,50.947005,0.922209,21.281775,-0.128362,48.554472,-0.209447,-0.584321,-0.044181,58.650848
5,2011,our audit committee and management have identi...,item business overview we are a leading n...,20120427,nwpx,item management s discussion and analysis o...,1.252679,2,3317,D,...,-4.860406e+06,49.961210,1.204679,24.127654,0.026656,50.792482,0.489925,0.078676,-0.019807,81.374244
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6473,2013,although we have chosen to pursue conversion t...,item business the words equinix we our o...,20140228,eqix,item management s discussion and analysis o...,1.911720,4,6798,H,...,-2.201136e+10,47.417896,40.392765,155.499941,-0.563620,48.492309,-0.597432,-0.418673,-0.059587,582.131185
6474,2014,we may not qualify or remain qualified as a re...,item business the words equinix we our o...,20150302,eqix,item management s discussion and analysis o...,2.184554,4,6798,H,...,-2.638618e+10,59.570540,40.465728,160.913902,1.485700,57.344488,1.629953,0.909879,0.112526,622.867336
6475,2016,consummation of the verizon asset purchase is ...,item business the words equinix we our ...,20170227,eqix,item management s discussion and analysis o...,1.366027,3,6798,H,...,-3.219974e+10,49.605018,38.301655,306.367418,1.477140,54.232824,1.026203,0.496825,0.074632,1265.705500
6476,2017,acquisitions present many risks and we may not...,table of contents item business the w...,20180226,eqix,item management s discussion and analysis o...,1.487546,3,6798,H,...,-3.375959e+10,56.892858,38.925276,388.472880,3.048024,58.235279,1.317394,0.816853,0.102055,1638.409575


In [31]:
hiv4 = ps.HIV4()
lm = ps.LM()
# lyl: get processed text
def get_processed_text(text):
    word_hiv4 = hiv4.tokenize(text)
    words_lm = lm.tokenize(text)
    return words_lm

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd3 in position 184: invalid continuation byte

In [23]:
df4 = pd.read_csv("final_v4_df.csv")

In [24]:
temp_risk_factor = df4['risk_factor'].iloc[0]

In [25]:
processed_rf = get_processed_text(temp_risk_factor)
print(processed_rf)

NameError: name 'hiv4' is not defined

In [30]:
# Use Harvard IV-4 / Loughran and McDonald Dictionaries https://sraf.nd.edu/loughranmcdonald-master-dictionary/
hiv4 = ps.HIV4()
lm = ps.LM()

def get_sentiment_scores(text):
    score = []
    words_hiv4 = hiv4.tokenize(text)
    words_lm = lm.tokenize(text)
    score_hiv4 = hiv4.get_score(words_hiv4)
    score_lm = hiv4.get_score(words_lm)
    score.append(a * score_hiv4['Positive'] + b * score_lm['Positive'])
    score.append(a * score_hiv4['Negative'] + b * score_lm['Negative'])
    score.append(a * score_hiv4['Polarity'] + b * score_lm['Polarity'])
    score.append(a * score_hiv4['Subjectivity'] + b * score_lm['Subjectivity'])
    return score, words_lm

# score, words_lm => put into the table
words_processed = []
for i,r in df.iterrows():
    scores_rf, words_rf = get_sentiment_scores(r['risk_factor'])
    scores_i1, words_i1 = get_sentiment_scores(r['item1'])
    scores_i7, words_i7 = get_sentiment_scores(r['item7'])
    words_processed.append([words_rf,words_i1,words_i7])
    df.loc[i,'risk_factor_pos'] = scores_rf[0]
    df.loc[i,'risk_factor_neg'] = scores_rf[1]
    df.loc[i,'risk_factor_polarity'] = scores_rf[2]
    df.loc[i,'risk_factor_subjectivity'] = scores_rf[3]
    df.loc[i,'risk_factor_len'] = len(words_rf)
    df.loc[i,'item1_pos'] = scores_i1[0]
    df.loc[i,'item1_neg'] = scores_i1[1]
    df.loc[i,'item1_polarity'] = scores_i1[2]
    df.loc[i,'item1_subjectivity'] = scores_i1[3]
    df.loc[i,'item1_len'] = len(words_i1)
    df.loc[i,'item7_pos'] = scores_i7[0]
    df.loc[i,'item7_neg'] = scores_i7[1]
    df.loc[i,'item7_polarity'] = scores_i7[2]
    df.loc[i,'item7_subjectivity'] = scores_i7[3]
    df.loc[i,'item1_len'] = len(words_i7)
    
words_processed = pd.DataFrame(words_processed)
words_processed.columns=['risk_factor_processed','item1_processed','item7_processed']
df


  self.init_dict()


Unnamed: 0_level_0,time,risk_factor,item1,10K_report_date,ticker,item7,growth,y,sic,sic_letter,...,risk_factor_len,item1_pos,item1_neg,item1_polarity,item1_subjectivity,item1_len,item7_pos,item7_neg,item7_polarity,item7_subjectivity
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,2006,a downturn in government spending related to p...,item business we are a leading north americ...,20070402,nwpx,table of contents item management s discus...,1.175339,2,3317,D,...,1789.0,438.0,174.0,0.431373,0.277174,2897.0,690.0,218.0,0.519824,0.313428
2,2007,a downturn in government spending related to p...,item business we are a leading north americ...,20080317,nwpx,table of contents item management s discus...,0.709823,0,3317,D,...,1731.0,441.0,178.0,0.424879,0.276957,2630.0,638.0,220.0,0.487179,0.326236
3,2008,the success of our business is affected by gen...,item business we are a leading north americ...,20090313,nwpx,table of contents item management s discus...,0.566531,0,3317,D,...,1760.0,354.0,139.0,0.436105,0.268519,2618.0,599.0,205.0,0.490050,0.307105
4,2010,our audit committee and management have identi...,item business overview we are a leading n...,20110322,nwpx,item management s discussion and analysis o...,1.331586,2,3317,D,...,3280.0,362.0,136.0,0.453815,0.260733,3246.0,746.0,298.0,0.429119,0.321627
5,2011,our audit committee and management have identi...,item business overview we are a leading n...,20120427,nwpx,item management s discussion and analysis o...,1.252679,2,3317,D,...,3232.0,388.0,142.0,0.464151,0.269857,3271.0,759.0,253.0,0.500000,0.309386
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6473,2013,although we have chosen to pursue conversion t...,item business the words equinix we our o...,20140228,eqix,item management s discussion and analysis o...,1.911720,4,6798,H,...,5767.0,806.0,274.0,0.492593,0.275089,10680.0,2269.0,1014.0,0.382272,0.307397
6474,2014,we may not qualify or remain qualified as a re...,item business the words equinix we our o...,20150302,eqix,item management s discussion and analysis o...,2.184554,4,6798,H,...,6561.0,922.0,280.0,0.534110,0.272501,8273.0,1710.0,826.0,0.348580,0.306539
6475,2016,consummation of the verizon asset purchase is ...,item business the words equinix we our ...,20170227,eqix,item management s discussion and analysis o...,1.366027,3,6798,H,...,7649.0,999.0,241.0,0.611290,0.267472,9157.0,1845.0,907.0,0.340843,0.300535
6476,2017,acquisitions present many risks and we may not...,table of contents item business the w...,20180226,eqix,item management s discussion and analysis o...,1.487546,3,6798,H,...,7793.0,966.0,246.0,0.594059,0.263995,9050.0,1781.0,874.0,0.341620,0.293370


In [42]:
dt = df.copy()
dt = dt.reset_index(drop=True)
dt = pd.concat([dt,words_processed],axis=1)
# dt.to_csv('./final_v4_df.csv')

In [43]:
dt.columns

Index(['time', 'risk_factor', 'item1', '10K_report_date', 'ticker', 'item7',
       'growth', 'y', 'sic', 'sic_letter', 'volatility', 'reaction_positive',
       'reaction_negative', 'volume_adi', 'volume_mfi', 'volatility_atr',
       'trend_sma_slow', 'trend_macd', 'momentum_rsi', 'momentum_roc',
       'momentum_ppo', 'others_dlr', 'others_cr', 'risk_factor_processed',
       'risk_factor_pos', 'risk_factor_neg', 'risk_factor_polarity',
       'risk_factor_subjectivity', 'risk_factor_len', 'item1_pos', 'item1_neg',
       'item1_polarity', 'item1_subjectivity', 'item1_len', 'item7_pos',
       'item7_neg', 'item7_polarity', 'item7_subjectivity',
       'risk_factor_processed', 'item1_processed', 'item7_processed'],
      dtype='object')

In [44]:
dt = dt.drop(columns=['risk_factor', 'item1', '10K_report_date', 'item7', 'sic', 'risk_factor_processed', 'item1_processed', 'item7_processed'])
dt = dt.join(pd.get_dummies(dt.sic_letter,prefix='industry'))
dt = dt.drop(columns=['sic_letter'])
dt

Unnamed: 0,time,ticker,growth,y,volatility,reaction_positive,reaction_negative,volume_adi,volume_mfi,volatility_atr,...,item7_polarity,item7_subjectivity,industry_B,industry_C,industry_D,industry_E,industry_F,industry_G,industry_H,industry_I
0,2006,nwpx,1.175339,2,0.062822,5,6,-3.515938e+05,50.359009,0.792847,...,0.519824,0.313428,0,0,1,0,0,0,0,0
1,2007,nwpx,0.709823,0,0.147784,5,6,-1.440436e+06,57.280553,1.412616,...,0.487179,0.326236,0,0,1,0,0,0,0,0
2,2008,nwpx,0.566531,0,0.180493,3,8,-1.414003e+06,53.844115,2.589620,...,0.490050,0.307105,0,0,1,0,0,0,0,0
3,2010,nwpx,1.331586,2,0.075963,10,1,-4.214262e+06,50.947005,0.922209,...,0.429119,0.321627,0,0,1,0,0,0,0,0
4,2011,nwpx,1.252679,2,0.161627,6,5,-4.860406e+06,49.961210,1.204679,...,0.500000,0.309386,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5390,2013,eqix,1.911720,4,0.256716,5,6,-2.201136e+10,47.417896,40.392765,...,0.382272,0.307397,0,0,0,0,0,0,1,0
5391,2014,eqix,2.184554,4,0.334596,3,8,-2.638618e+10,59.570540,40.465728,...,0.348580,0.306539,0,0,0,0,0,0,1,0
5392,2016,eqix,1.366027,3,0.108364,4,7,-3.219974e+10,49.605018,38.301655,...,0.340843,0.300535,0,0,0,0,0,0,1,0
5393,2017,eqix,1.487546,3,0.093951,3,8,-3.375959e+10,56.892858,38.925276,...,0.341620,0.293370,0,0,0,0,0,0,1,0


In [47]:
dt.columns

Index(['time', 'ticker', 'growth', 'y', 'volatility', 'reaction_positive',
       'reaction_negative', 'volume_adi', 'volume_mfi', 'volatility_atr',
       'trend_sma_slow', 'trend_macd', 'momentum_rsi', 'momentum_roc',
       'momentum_ppo', 'others_dlr', 'others_cr', 'risk_factor_pos',
       'risk_factor_neg', 'risk_factor_polarity', 'risk_factor_subjectivity',
       'risk_factor_len', 'item1_pos', 'item1_neg', 'item1_polarity',
       'item1_subjectivity', 'item1_len', 'item7_pos', 'item7_neg',
       'item7_polarity', 'item7_subjectivity', 'industry_B', 'industry_C',
       'industry_D', 'industry_E', 'industry_F', 'industry_G', 'industry_H',
       'industry_I'],
      dtype='object')

In [46]:
dt.to_csv('./final_v1_dataset.csv')

In [None]:
# Calculating percentage of complex word 
# It is calculated using Percentage of Complex words = the number of complex words / the number of words 
def percentage_complex_word(text):
    ss = token_lemt(text)
    tokens = tokenizer(ss)
    complexWord = 0
    complex_word_percentage = 0
    
    for word in tokens:
        vowels=0
        if word.endswith(('es','ed')):
            pass
        else:
            for w in word:
                if(w=='a' or w=='e' or w=='i' or w=='o' or w=='u'):
                    vowels += 1
            if(vowels > 2):
                complexWord += 1
    if len(tokens) != 0:
        complex_word_percentage = complexWord/len(tokens)
    
    return complex_word_percentage

# Counting complex words
def complex_word_count(text):
    ss = token_lemt(text)
    tokens = tokenizer(ss)
    complexWord = 0
    
    for word in tokens:
        vowels=0
        if word.endswith(('es','ed')):
            pass
        else:
            for w in word:
                if(w=='a' or w=='e' or w=='i' or w=='o' or w=='u'):
                    vowels += 1
            if(vowels > 2):
                complexWord += 1
    return complexWord

# calculating Fog Index 
# Fog index is calculated using -- Fog Index = 0.4 * (Average Sentence Length + Percentage of Complex words)
def fog_index(averageSentenceLength, percentageComplexWord):
    fogIndex = 0.4 * (averageSentenceLength + percentageComplexWord)
    return fogIndex

# %%time
# df['XXX'] = df['XXX'].swifter.apply(token_lemt)

In [None]:
df.columns