# Measurement Memo 
## Tereza Petrovicova 
### November 5

*In this measurement memo, you will create a measure from your data.  Please submit your R code along with your memo (you do not have to submit your data). Decide on something you'd like to measure within your data that could be achieved by hand coding.*

For this project, I am analyzing regulatory exposure related to energy regulations within companies' annual 10-K filings. These 10-K reports are comprehensive and often contain large sections dedicated to financial details, which would be overwhelming and unnecessary for human coders to read in full. Therefore, I first narrowed my extraction to the most relevant section: Item 1A - Risk Factors. This section is mandated by the SEC and requires companies to disclose any potential risks they face, making it a concentrated source of regulatory information.

After extracting the Risk Factors section, I found that even this section averages around two pages per document. Therefore, I split each Risk Factors section into chunks of 10 sentences. This breakdown should make it easier for coders to work through the text systematically without being overwhelmed by lengthy passages, while still retaining the nuanced context needed to assess regulatory exposure.

## Part A: Creating a Codebook

I am interested in measuring environmental and energy regulatory discourse in 10Ks. I am first trying to caputrue whether there is any of this discourse happening, and then seeing whether it is possible ti disentangle between environemnral and energy regulation. Then finally I am trying to see whether this regulation happens at different government levels (federal, state, local). I am also creating a category of cost, where I want to see whether the regulation poses a direct cost to the company, or is merely stated as something that may impact/disrupt business. 

| **Variable Name**          | **Description**                                                                                         | **Coding**                                 | **Example Keywords/Indicators**                                     |
|----------------------------|---------------------------------------------------------------------------------------------------------|--------------------------------------------|---------------------------------------------------------------------|
| **Regulation_Present**     | Indicates if there is a discussion of environmental or energy regulation specifically                   | 1 = Yes<br>0 = No                          | "environmental regulation," "energy regulation," "EPA," "FERC"      |
| **Regulation_Type**        | Specifies the type of regulation discussed: environmental, energy, both, or unclear                     | 1 = Environmental<br>2 = Energy<br>3 = Both<br>0 = None | "environmental standards," "pollution control," "FERC," "energy grid," "electricity rates" |
| **Regulation_Cost**        | Indicates if regulation is framed as a cost or burden to the company                                    | 1 = Yes<br>0 = No                          | "cost," "compliance burden," "penalty," "increased expenses"        |
| **Federal_Regulation**     | Indicates the presence of federal-level regulation (mentions of agencies like EPA, FERC)                | 1 = Yes<br>0 = No                          | "federal," "EPA," "FERC"                                            |
| **State_Regulation**       | Indicates the presence of state-level regulation discussion                                             | 1 = Yes<br>0 = No                          | "state regulation," "public utility commission," "Renewable Portfolio Standard (RPS)" |
| **Local_Regulation**       | Indicates the presence of local-level regulation discussion                                             | 1 = Yes<br>0 = No                      | "zoning ordinances", "local regulation"



** if regulation_present is 0 then all following variables are also 0

### Detailed Explanation of each Category:
1.	**Regulation_Present** -  This binary variable captures the presence of either environmental or energy regulation specifically within the text. For an entry to be coded as "1," there must be an explicit reference to regulatory terms that are clearly related to environmental or energy concerns. General mentions of "governmental regulation" or similar non-specific terms are not included here. Examples of indicators include references to specific agencies or terms like "EPA regulations" or "energy compliance."

2.	**Regulation_Type** - This variable distinguishes the type of regulation being discussed if it can be identified. It’s coded as:
- 1 for Environmental Regulation, which includes mentions of regulations or policies aimed at environmental protection, pollution control, emissions standards, clean water standards, etc.
- 	2 for Energy Regulation, which includes policies governing the production, transmission, and distribution of energy, such as FERC rules or electricity rate standards.
-	3 if both types are mentioned together in the text.
-	0 if neither type is mentioned or if the regulation is unclear.

3.	**Regulation_Cost** - This variable indicates if the regulation is framed as imposing a direct cost or burden on the company. The text should clearly indicate that the regulation results in financial expenses, increased operational costs, or requirements for additional resources. Indicators include phrases like "increased cost," "compliance burden," or "cost of raw inputs."

4.	**Federal_Regulation** - This variable captures mentions of federal-level regulation or oversight. Indicators include explicit references to federal agencies such as the Environmental Protection Agency (EPA) or the Federal Energy Regulatory Commission (FERC), or mentions of federal regulations in a way that clarifies the jurisdiction. It’s coded as "1" for presence and "0" otherwise.

5.	**State_Regulation** - This binary variable identifies whether there is a mention of state-level regulatory policies. It captures regulations at the state level, such as Renewable Portfolio Standards (RPS) or state environmental agencies. Mentions of specific state entities, state mandates, or compliance with state laws are key indicators.

6.	**Local_Regulation** - This variable captures mentions of local-level regulations, such as city ordinances or county-level environmental policies. The text should refer explicitly to local jurisdictions, entities, or ordinances. It is coded as "1" if present and "0" if absent.



## Part B: Selecting Training Set 

This section is divided into two parts, pre-rpocessing and selecting 50 chunks of text at random.

### Pre-processing
In this section I am pre-processing the 10ks, which are really long. So I just decided to focus on "Item 1A: Risk Factors." I then divided each document into chunks of 10 sentences. 

In [1]:
import pandas as pd
import os

# Set the working directory
os.chdir("/Users/teri/Documents/GitHub/Energy10k")

# Read metadata 
metadata = pd.read_csv("metadata2024.csv")

In [2]:
import re

# Define a function to extract the "Risk Factors" section and set a flag for missing items
def extract_risk_section(text):
    # Step 1: Search for "ITEM 1A.RISK FACTORS" in the document
    risk_start = re.search(r'Item\s*1A\s*\.\s*Risk\s*Factors', text, re.IGNORECASE)
    
    # Initialize a flag for successful extraction
    extraction_successful = True

    # Step 2: If found, slice from that position
    if risk_start:
        start_idx = risk_start.start()
        
        # Try finding the next section (often "Item 1B") to mark the end of the Risk Factors section
        next_item_1b = re.search(r'Item\s*1B\s*\.', text[start_idx:], re.IGNORECASE)
        
        # If "Item 1B" is not found, look for "Item 2"
        if next_item_1b:
            end_idx = start_idx + next_item_1b.start()
        else:
            next_item_2 = re.search(r'Item\s*2\s*\.', text[start_idx:], re.IGNORECASE)
            if next_item_2:
                end_idx = start_idx + next_item_2.start()
            else:
                # If neither "Item 1B" nor "Item 2" is found, mark extraction as unsuccessful
                extraction_successful = False
                end_idx = len(text)  # If neither "Item 1B" nor "Item 2" is found, take the rest of the text

        risk_text = text[start_idx:end_idx]
        return risk_text, extraction_successful
    else:
        # Mark extraction as unsuccessful if "Item 1A.RISK FACTORS" section is not found
        return None, False

# Apply the function to the 'text' column and filter out unsuccessful extractions
metadata[['risk_text', 'extraction_successful']] = metadata.apply(
    lambda row: pd.Series(extract_risk_section(row['text'])), axis=1
)

# Drop rows where extraction was unsuccessful
metadata_filtered = metadata[metadata['extraction_successful']]

# Drop the 'extraction_successful' column as it's no longer needed
metadata_filtered = metadata_filtered.drop(columns=['extraction_successful'])

# Display the resulting DataFrame
#metadata_filtered[['accession', 'time', 'ticker', 'naics', 'risk_text']]


In [3]:
# Define a function to count words in a text
def word_count(text):
    if text:
        return len(text.split())

    return 0  # Return 0 if text is None

# Apply the word count function to the 'risk_text' column
metadata['risk_word_count'] = metadata['risk_text'].apply(word_count)

# Display the accession, ticker, and word count for each entry
print(metadata[['ticker', 'risk_word_count', 'risk_text']])

    ticker  risk_word_count                                          risk_text
0      AEE            10026  ITEM 1A. RISK FACTORS Investors should review ...
1      AEP            11158  ITEM 1A.   RISK FACTORS GENERAL RISKS OF REGUL...
2      AES            12948  ITEM 1A. RISK FACTORS You should consider care...
3      AGR             8158  Item 1A. Risk Factors You should carefully con...
4      ALE             7620  Item 1A. Risk Factors The risks and uncertaint...
..     ...              ...                                                ...
144    VGZ             5462  Item 1A. Risk Factors” below in this annual re...
145    VMC                0                                               None
146    VST            27854  Item 1A. Risk Factors for additional informati...
147    WEC            21072  Item 1A. Risk Factors - Risks Related to Legis...
148    WTI            18131  Item 1A. Risk Factors contained herein for fur...

[149 rows x 3 columns]


In [4]:
# Get the text content for the specified ticker
#awk_text = metadata[metadata['ticker'] == 'AEE']['risk_text'].values[0]

# Display the last 500 characters
#print(awk_text[-500:])
metadata = metadata.drop(columns=['text'])
metadata.to_csv("reg_expo_data.csv", index=False)

In [5]:
## Split into chunks of 10 sentences 
import pandas as pd
import nltk
from nltk.tokenize import sent_tokenize

# Ensure the punkt tokenizer is downloaded
#nltk.download('punkt')

# Function to split text into chunks of 10 sentences
def split_text_into_chunks(text, chunk_size=10):
    if text is None:
        return []  # Return an empty list if the text is None
    sentences = sent_tokenize(text)  # Split text into sentences
    chunks = [sentences[i:i + chunk_size] for i in range(0, len(sentences), chunk_size)]
    return [" ".join(chunk) for chunk in chunks]  # Join sentences back into chunk strings

# Create a new DataFrame to hold the chunked data
chunked_data = []

# Iterate over each row and split 'risk_text' into chunks
for idx, row in metadata.iterrows():
    chunks = split_text_into_chunks(row['risk_text'], 10)
    chunk_dict = {'ticker': row['ticker']}  # Start with ticker
    # Add each chunk to the dictionary
    for i, chunk in enumerate(chunks):
        chunk_dict[f'chunk_{i+1}'] = chunk
    chunked_data.append(chunk_dict)

# Convert the list of dictionaries into a DataFrame
chunked_df = pd.DataFrame(chunked_data)

# Concatenate with original DataFrame, removing 'risk_text' if not needed
metadata_new = pd.concat([metadata.drop(columns=['risk_text']), chunked_df.drop(columns=['ticker'])], axis=1)

# Display the modified DataFrame
#metadata_new.head(20)


In [100]:
# Save Risk sections DataFrame to a CSV file, removing the full text of the 10k
#metadata_new = metadata_new.drop(columns=['risk_text'])
metadata_new.to_csv("reg_expo_data.csv", index=False)

### Selecting 50 Chunks at Random

In the code Below I randomly selected a training set of 50 samples of 10 sentences each. 

In [None]:
import pandas as pd
import numpy as np

# Assuming your DataFrame is named `metadata_filtered`
# Reshape the DataFrame to have one chunk per row along with the ticker

# Melt the DataFrame to convert all chunk columns into rows
chunk_columns = [col for col in metadata_new.columns if col.startswith('chunk_')]
reshaped_df = metadata_new.melt(id_vars=['ticker'], value_vars=chunk_columns, 
                                     var_name='chunk_number', value_name='chunk_text')

# Drop any rows where 'chunk_text' is NaN (empty chunks)
reshaped_df = reshaped_df.dropna(subset=['chunk_text'])

# Set a seed for reproducibility
seed = 45

# previous seed was 42 - for the sample of 50

# Randomly sample 50 rows, ensuring we only pick one chunk per company
sampled_chunks = reshaped_df.groupby('ticker').sample(n=1, random_state=seed).sample(n=130, random_state=seed)

# Display the sampled chunks
sampled_chunks = sampled_chunks.reset_index(drop=True)
sampled_chunks[['ticker', 'chunk_text']]

# Save into .csv file
sampled_chunks.to_csv("sample130.csv", index=False)



## Part C and D: Human Coding and Intercoder Reliability

I first coded the 50 randomly selected chunks, and I also got Harry to do the same. I then combined our results into one dataframe and ran intercoder reliability tests on our results. In particular, I created a confusion matrix and calculated krippendorff's alpha for each variable. 

In [134]:

# Define the categories to analyze
categories = ['regulation_present','regulation_type', 'regulation_cost', 'federal_regulation', 'state_regulation', 'local_regulation']

# Loop through each category and calculate the confusion matrix and Krippendorff's alpha
for category in categories:
    coder1 = sample[f'{category}_coder1']
    coder2 = sample[f'{category}_coder2']
    
    # Calculate and display the confusion matrix
    conf_matrix = confusion_matrix(coder1, coder2)
    print(f"Confusion Matrix for {category}:\n", conf_matrix)
    
    # Prepare data for Krippendorff's alpha
    data_for_alpha = [coder1.values, coder2.values]
    
    # Calculate and display Krippendorff's alpha
    alpha = krippendorff.alpha(reliability_data=data_for_alpha)
    print(f"Krippendorff's alpha for {category}:", alpha)
    
    # Display mismatches
    mismatches = sample[coder1 != coder2]
    print(f"Mismatches between coders for {category}:\n", mismatches[['ticker', f'{category}_coder1', f'{category}_coder2']])
    print("\n" + "-"*50 + "\n")




Confusion Matrix for regulation_present:
 [[27  0]
 [ 6 17]]
Krippendorff's alpha for regulation_present: 0.7525
Mismatches between coders for regulation_present:
    ticker  regulation_present_coder1  regulation_present_coder2
0     MGY                          1                          0
3     EOG                          1                          0
20   SMLP                          1                          0
27   LBRT                          1                          0
33    RRC                          1                          0
49    CPK                          1                          0

--------------------------------------------------

Confusion Matrix for regulation_type:
 [[28  0  0  0]
 [ 3  6  1  0]
 [ 1  3  3  0]
 [ 1  3  0  1]]
Krippendorff's alpha for regulation_type: 0.6115743011280039
Mismatches between coders for regulation_type:
    ticker  regulation_type_coder1  regulation_type_coder2
0     MGY                       1                       0
3     EOG 


### Discussion of results

The results reveal a pattern in my coding approach: I tended to over-code instances compared to Coder 2, with very few cases where Coder 2 identified an instance that I missed. This suggests a systematic difference in our interpretations, likely due to my broader reading of the categories.


#### regulation_present

For the **regulation_present** category, we achieved the highest Krippendorff's alpha score of 0.75 which is not too high. This makes sense, as this is the broadest category, serving as a foundational category for the more specific sub-categories. If we disagree at this foundational level, disagreements in the subsequent sub-categories are expected by design. For example, there was a case where Coder 2 didn’t consider the regulation to pertain to the environment, while I did. That decision affected our coding in related columns for federal, state, and local levels, highlighting how initial disagreements can cascade into multiple coding differences. In most ohter categories the Krippendorff's alpha score was about 0.6 so quite low, but again this is also affected by the construct of the categories. 

Upon examining discrepancies, I identified six instances where I coded "1" and Coder 2 coded "0." In three of these cases, the environmental regulation was less prominent in the text, possibly buried among other information. In the remaining three cases, the differences were more nuanced. For example, two of these cases involved hydraulic fracturing. One instance discussed permitting and leasing practices, while the other focused on health risks associated with handling hydraulic fracture sand. In the first case, the regulation was phrased as: “Acting Secretary for the Department of the Interior signed an order effectively suspending new fossil fuel leasing and permitting on federal lands." I interpreted this as an environmental/energy regulation, as it directly relates to energy practices on public lands, and i think it should be coded as "1". However, in the other instance, the text discussed "the actual or perceived health risks of handling hydraulic fracture sand," which my peer did not classify as environmental regulation. Upon further discussion, we agreed that this second regulation, while affecting energy firms, stems from health and safety concerns rather than environmental ones. Another interesting discrepancy involved an SEC rule requiring companies to report environmental risks. This was challenging, but ultimately, I decided to code it as "0" since it does not alter firms’ operational practices but rather their reporting requirements. The key takeaway here is the importance of discerning the intent behind the regulation rather than simply identifying sector-related keywords. If a regulation impacts energy firms purely as a byproduct of their industry (such as health and safety), it may not fit under regulation_present unless environmental or energy concerns are explicitly the focus.

Given these observations, I plan to refine my codebook definition for regulation_present to be:

 - **regulation_present:** *This binary variable captures the presence of either environmental or energy regulation specifically within the text. It must be clear that the regulation is motivated by environmental or energy concerns, rather than general regulatory impacts that incidentally affect energy or environmental firms.*

This refined definition should help reduce ambiguity and improve coder alignment by focusing on the regulatory intention rather than incidental industry effects.


#### regulation_cost

This category turned out to be somewhat redundant and poorly defined. In essence, any mention of regulatory risk can imply a potential cost to the firm, given the nature of 10-K risk sections. For example, we did not have a clear concensus on whether an implicit costs such as "delays" should be coded as 1 or 0. My intention with regulation_cost was to replicate a category used in Satuner et al. (2023), a dataset I am trying to adapt for 10-Ks instead of earnings calls. However, in 10-K filings, nearly every mention of environmental regulation poses a potential or implied cost, making this category less useful as a distinctive measure. F Additionally, my confusion matrix is 3x3 due to a typo ("2" instead of "1" or "0").

#### regulation_type
This category proved to be challenging because environmental and energy regulations are often discussed together. Differentiating between the two can be difficult, as they are interlinked (e.g., emissions restrictions often impact energy production). The distinction became murky in cases where the regulation talked about both emissions limitations and restrictions on oil and gas production.

In retrospect, it might be more insightful to remove **regulation_type** as a hand-coded category and instead use a topic modeling approach on the text chunks that include any form of environmental or energy regulation. This approach could reveal natural clusters of topics and identify distinguishing words, providing a more nuanced understanding of regulatory themes than manual coding could achieve.

If I were to keep this category, a refined codebook definition might specify:

*Environmental Regulation: Focuses on controlling or reducing emissions, pollutants, and environmental impact.*

*Energy Regulation: Pertains to rules affecting energy production, transmission, and extraction processes (e.g., hydraulic fracturing regulations, grid management policies), moderating enrgy cost, availability, etc.*

#### Federal, State, and Local Regulation Levels
For the levels of regulation (federal_regulation, state_regulation, local_regulation), most discrepancies arose from cases where we didn’t initially agree on whether the regulation was environmental or energy-related. Excluding those instances, the primary difference was that the second coder occasionally missed mentions of "stae regulation" or "local regulation" within the text, rather than a fundamental disagreement on whether it applied. This underscores the importance of thorough and detailed coding for levels of government, especially for local ordinances, which can be subtle.

#### Reflection on Coding Process

This exercise highlighted some practical challenges with manual coding. Human coders can experience fatigue, which may impact attention and accuracy, especially when coding lengthy text chunks. I initially broke the text into 10-sentence chunks, but I now realize that shorter chunks (e.g., 5 sentences) might be more manageable. However, there’s a trade-off: shorter chunks risk losing context that spans multiple sentences, which could be crucial for identifying environmental or energy regulation mentions.

Additionally, the value of knowledgeable coders became apparent. Recognizing specific agencies, acronyms, and regulatory bodies (e.g., distinguishing federal agencies from state/local entities) requires familiarity with the subject matter. Although my codebook covered some of this, coders sometimes had to research terms to ensure accuracy. This experience underlines the importance of using well-prepared coders who can apply domain knowledge effectively for accurate results.

