# EDA of the relationship between stock price direction and the content of 10K / 10Q reports

## 1. Problem and key areas of the analysis

**Forms 10K and 10Q** are among the most important reporting forms for publicly traded companies in USA:

* Form 10-K is an annual comprehensive report filed by companies with the SEC, detailing their financial performance and business operations. It includes audited financial statements, management's discussion and analysis, risk factors, and other important disclosures.

* Form 10-Q contains financial statements, management discussion and analysis, disclosures, and internal controls for the previous quarter. Companies must file their 10-Qs 40 or 45 days after the end of their quarters, depending on the size of their public float.

**Dataset.** Analysis was performed using the next data:
* Loughran-McDonald SEC/EDGAR dataset
* Cleaned files from the stage one parsing process
* Time period: 

This **feature analysis** involves next areas: 
* Full content of 10K/10Q form 
* Section of Management discussion and analysis (MDA)
* Risk Factors section (if applicable, principally for 10K form)

The **target variable** is defined as follows:
* Three gradations of stock price direction: up (1), down (-1), no change (0)
* Boundaries for the "no change" category: no more than 1% up / down, no more than 2% up / down
* To determine the value of the target variable, price changes were used on the following horizons: 3 days, 10 days, 30 days

# 2. Feature generation and targets calculation 

In [None]:
import pandas as pd
import re

df = pd.read_csv('data/final_data.csv', index_col=0)

# remove all consecutive spaces and line breaks
df['full_content'] = df['full_content'].replace(r'\n', '', regex=True).replace(r'\s+', ' ', regex=True)

# calculate length of reports
df['full_content_length'] = df['full_content'].apply(len)

# function for extracting text items from report
def extract_matching_text(text, pattern):
    matches = re.findall(pattern, text, re.IGNORECASE)
    if matches:
        return ' '.join(matches)
    return None 

In [None]:
# Extracting Management discussion and analysis and length calculation

pattern_mda = r'ITEM.{,20}MANAGEMENT.{,10}DISCUSSION.{,10}ANALYSIS.{,10}OF.{,10}FINANCIAL(.*?)ITEM.{0,3}\d.{0,4}QUANTITATIVE'
df['MDA'] = df['full_content'].apply(lambda x: extract_matching_text(x, pattern_mda))
df['MDA_lenght'] = df['MDA'].apply(len)


# Extracting Quantitative and Qualitative Disclosures About Market Risk and length calculation

pattern_market = r'ITEM.{,20}QUANTITATIVE.{,10}AND.{,10}QUALITATIVE.{,10}DISCLOSURES.{,10}ABOUT(.*?)ITEM.{0,3}\d.{0,4}CONTROLS'
df['MARKET_RISK'] = df['full_content'].apply(lambda x: extract_matching_text(x, pattern_market))
df['MARKET_RISK_length'] = df['MARKET_RISK'].apply(len)


# Extracting Risk Factors and length calculation

pattern_risks = r'ITEM.{,20}RISK.{,10}FACTORS.{,10}(.*?)ITEM.{0,3}\d.{0,4}UNREGISTERED'
df['RISK_FACTORS'] = df['full_content'].apply(lambda x: extract_matching_text(x, pattern_risks))
df['RISK_FACTORS_length'] = df['RISK_FACTORS'].apply(len)