# EDA of the relationship between stock price direction and the content of 10K / 10Q reports

## 1. Problem and key areas of the analysis

**Forms 10K and 10Q** are among the most important reporting forms for publicly traded companies in USA:

* Form 10-K is an annual comprehensive report filed by companies with the SEC, detailing their financial performance and business operations. It includes audited financial statements, management's discussion and analysis, risk factors, and other important disclosures.

* Form 10-Q contains financial statements, management discussion and analysis, disclosures, and internal controls for the previous quarter. Companies must file their 10-Qs 40 or 45 days after the end of their quarters, depending on the size of their public float.

**Dataset.** Analysis was performed using the next data:
* Loughran-McDonald SEC/EDGAR dataset
* Cleaned files from the stage one parsing process
* Time period: 

This **feature analysis** involves next areas: 
* Full content of 10K/10Q form 
* Section of Management discussion and analysis (MDA)
* Risk Factors section (if applicable, principally for 10K form)

The **target variable** is defined as follows:
* Three gradations of stock price direction: up (1), down (-1), no change (0)
* Boundaries for the "no change" category: no more than 1% up / down, no more than 2% up / down
* To determine the value of the target variable, price changes were used on the following horizons: 3 days, 10 days, 30 days

If the quotation day fell on a day off, the first earlier day (for the previous day) or the first following working day (for the next day) was taken.

# 2. Feature and targets generation

In [33]:
import pandas as pd
import re
from joblib import Parallel, delayed

N_CORES = -1 # -1: all avaliable cores

df = pd.read_csv('/Users/pavel/Documents/NLP_stock_model/nlp-stock-model/data/final_data.csv', index_col=0)

# function for extracting text items from report

def clean_text(text):
    cleaned_text = re.sub(r'\s+', ' ', re.sub(r'\n', ' ', text)).strip()
    return cleaned_text

df['full_content'] = Parallel(n_jobs=N_CORES)(delayed(clean_text)(x) for x in df['full_content'])
df['full_content_length'] = Parallel(n_jobs=N_CORES)(delayed(len)(x) for x in df['full_content'])

df = df.dropna(subset=['close_day_before'])

In [34]:
# function for extracting text items from report
def extract_matching_text(text, pattern):
    matches = pattern.findall(text)
    if matches:
        return ' '.join(matches)
    return None 

# Extracting Management discussion and analysis and length calculation

pattern_mda = re.compile(r'ITEM.{,20}MANAGEMENT.{,10}DISCUSSION.{,10}ANALYSIS.{,10}OF.{,10}FINANCIAL(.*?)ITEM.{0,3}\d.{0,4}QUANTITATIVE', re.IGNORECASE)
df['MDA'] = Parallel(n_jobs=N_CORES)(delayed(extract_matching_text)(x, pattern_mda) for x in df['full_content'])
df['MDA_lenght'] = df['MDA'].apply(lambda x: None if pd.isna(x) else len)


# Extracting Quantitative and Qualitative Disclosures About Market Risk and length calculation

pattern_market = re.compile(r'ITEM.{,20}QUANTITATIVE.{,10}AND.{,10}QUALITATIVE.{,10}DISCLOSURES.{,10}ABOUT(.*?)ITEM.{0,3}\d.{0,4}CONTROLS', re.IGNORECASE)
df['MARKET_RISK'] = Parallel(n_jobs=N_CORES)(delayed(extract_matching_text)(x, pattern_market) for x in df['full_content'])
df['MARKET_RISK_length'] = df['MARKET_RISK'].apply(lambda x: None if pd.isna(x) else len)


# Extracting Risk Factors and length calculation

pattern_risks = re.compile(r'ITEM.{,20}RISK.{,10}FACTORS.{,10}(.*?)ITEM.{0,3}\d.{0,4}UNREGISTERED', re.IGNORECASE)
df['RISK_FACTORS'] = Parallel(n_jobs=N_CORES)(delayed(extract_matching_text)(x, pattern_risks) for x in df['full_content'])
df['RISK_FACTORS_length'] = df['RISK_FACTORS'].apply(lambda x: None if pd.isna(x) else len)

In [35]:
# Target generation

def calculate_target(row, period):
    price_diff = row[f'close_day_after_{period}'] / row['close_day_before']
    return 1 if price_diff > 1 else (-1 if price_diff < 1 else 0)

def calculate_target_2_percent(row, period):
    price_diff = row[f'close_day_after_{period}'] / row['close_day_before'] - 1
    return 1 if price_diff > 0.02 else (-1 if price_diff < -0.02 else 0)

def generate_targets(df):
    df['target_3'] = Parallel(n_jobs=N_CORES)(delayed(calculate_target)(row, 3) for _, row in df.iterrows())
    df['target_10'] = Parallel(n_jobs=N_CORES)(delayed(calculate_target)(row, 10) for _, row in df.iterrows())
    df['target_30'] = Parallel(n_jobs=N_CORES)(delayed(calculate_target)(row, 30) for _, row in df.iterrows())
    df['target_3_2p'] = Parallel(n_jobs=N_CORES)(delayed(calculate_target_2_percent)(row, 3) for _, row in df.iterrows())
    df['target_10_2p'] = Parallel(n_jobs=N_CORES)(delayed(calculate_target_2_percent)(row, 10) for _, row in df.iterrows())
    df['target_30_2p'] = Parallel(n_jobs=N_CORES)(delayed(calculate_target_2_percent)(row, 30) for _, row in df.iterrows())
    return df


df = generate_targets(df)

In [37]:
df.head(15)

Unnamed: 0,ticker,publication_date,report_date,report_type,full_content,close_day_before,close_day_after_1,close_day_after_3,close_day_after_10,close_day_after_30,...,MARKET_RISK,MARKET_RISK_length,RISK_FACTORS,RISK_FACTORS_length,target_3,target_10,target_30,target_3_2p,target_10_2p,target_30_2p
0,NKE,2004-01-12,2003-11-30,10-Q,<Header> <FileStats> <FileName>20040112_10-Q_e...,67.87,69.1,69.36,69.58,72.36,...,MARKET RISK There have been no material chang...,<built-in function len>,,,1,1,1,1,1,1
1,CTAS,2004-01-14,2003-11-30,10-Q,<Header> <FileStats> <FileName>20040114_10-Q_e...,45.55,45.75,45.38,46.03,44.22,...,Market Risk. MARKET RISK. In our normal ope...,<built-in function len>,,,0,1,0,0,0,-1
2,STZ,2004-01-14,2003-11-30,10-Q,<Header> <FileStats> <FileName>20040114_10-Q_e...,32.56,32.63,32.96,32.9,35.41,...,MARKET RISK - ------- -----------------------...,<built-in function len>,,,1,1,1,0,0,1
3,MCK,2004-01-29,2003-12-31,10-Q,<Header> <FileStats> <FileName>20040129_10-Q_e...,29.0,29.38,29.31,28.8,27.82,...,,,,,1,0,0,0,0,-1
4,SNPS,2004-01-29,2003-10-31,10-K,<Header> <FileStats> <FileName>20040129_10-K_e...,35.26,35.29,35.52,36.35,29.65,...,Market Risk 46 Item 8. Financial Statements a...,<built-in function len>,,,1,1,0,0,1,-1
5,PH,2004-01-30,2003-12-31,10-Q,<Header> <FileStats> <FileName>20040130_10-Q_e...,54.93,54.65,53.94,56.21,56.82,...,,,,,0,1,1,0,1,1
6,PG,2004-02-03,2003-12-31,10-Q,<Header> <FileStats> <FileName>20040203_10-Q_e...,101.68,103.2,102.55,103.12,101.76,...,,,,,1,1,1,0,0,0
7,ADP,2004-02-03,2003-12-31,10-Q,<Header> <FileStats> <FileName>20040203_10-Q_e...,43.4,42.68,43.48,44.5,43.19,...,,,,,1,1,0,0,1,0
8,SBUX,2004-02-04,2003-12-28,10-Q,<Header> <FileStats> <FileName>20040204_10-Q_e...,36.47,36.76,36.83,38.75,37.46,...,Market Risk Market Risk 16 Market Risk Fo...,<built-in function len>,,,1,1,1,0,1,1
9,ADBE,2004-02-05,2003-11-28,10-K,<Header> <FileStats> <FileName>20040205_10-K_e...,37.92,38.92,38.88,39.19,35.83,...,Market Risk 57 Item 8. Financial Statements a...,<built-in function len>,,,1,1,0,1,1,-1
