This is the development notebook that I used to extract the rules into dataframes and get the appropriate embeddings. Each pdf from TCEQ takes some slight modifications so there is no way to (at least with my skills) to fully automate this process. It's easy enough to spend the 20 minutes iterating on the extraction until there are no overlaps and the rules look good.

I will not be documenting this workbook anymore than it already is. However, feel free to adapt it for other rule sets if you desire.

In [49]:
import tiktoken
## We will estimate tokens using the OpenAI Ada encoding. Not perfect but probably good enough.
embedding_encoding = "cl100k_base"  # this the encoding for text-embedding-ada-002
encoding = tiktoken.get_encoding(embedding_encoding)

In [None]:
chap_name = '395'

In [50]:
with open(f"./raw_data/{chap_name}.txt", "r") as f:
    text = f.read()

In [51]:
out = encoding.encode(text)
len(out)

In [53]:
import re

def extract_rules(raw_text):
    lines = raw_text.split('\n')
    rule_pattern = re.compile(r'Sec. 395.\d+') #edit this accordingly
    snippets = []
    current_snippet = ''

    for i, line in enumerate(lines):
        if rule_pattern.match(line.strip()) and (i == 0 or not lines[i - 1].strip()):
            if current_snippet:
                snippets.append(current_snippet)
                current_snippet = ''
        current_snippet += line + '\n'

    if current_snippet:
        snippets.append(current_snippet)

    return snippets

# Extract the rules into standalone snippets
snippets = extract_rules(text)


In [54]:
all_snippets = []
for snip in snippets:
    snips = snip.split('\n')
    all_snippets.extend(snips)
snippets = all_snippets

In [55]:
snippets

['Sec. 395.001.  DEFINITIONS.  In this chapter:',
 '',
 '(1)  "Capital improvement" means any of the following facilities that have a life expectancy of three or more years and are owned and operated by or on behalf of a political subdivision:',
 '',
 '(A)  water supply, treatment, and distribution facilities; wastewater collection and treatment facilities; and storm water, drainage, and flood control facilities; whether or not they are located within the service area; and',
 '',
 '(B)  roadway facilities.',
 '',
 '(2)  "Capital improvements plan" means a plan required by this chapter that identifies capital improvements or facility expansions for which impact fees may be assessed.',
 '',
 '(3)  "Facility expansion" means the expansion of the capacity of an existing facility that serves the same function as an otherwise necessary new capital improvement, in order that the existing facility may serve new development.  The term does not include the repair, maintenance, modernization, or 

In [56]:
def clean_snippets(snippets):
    cleaned_snippets = []

    for i, snippet in enumerate(snippets):
        # Remove unnecessary whitespace at the beginning and end of the snippet
        snippet = snippet.strip()

        # Replace multiple consecutive newline characters with a single newline character
        snippet = re.sub(r'\n\s*\n', '\n\n', snippet)

        # Remove page headers and footers and any other lines that are not part of a rule. Edit as necessary
        snippet = re.sub(r'\x0cTexas Commission on Environmental Quality Page \d+\s*', '', snippet)
        snippet = re.sub(r'Chapter 217 - Design Criteria for Domestic Wastewater Systems\s*', '', snippet)
        if snippet.startswith('Acts'):
            snippet = ''
        if snippet.startswith('SUBCHAPTER'):
            snippet = ''
        if snippet.startswith('Added by Acts'):
            snippet = ''
        if snippet.startswith('SUBCHAPTER'):
            snippet = ''
        if snippet.startswith('§§'):
            snippet = ''

        cleaned_snippets.append(snippet)

    return cleaned_snippets

# Clean the extracted snippets
cleaned_snippets = clean_snippets(snippets)


In [57]:
cleaned_snippets

['Sec. 395.001.  DEFINITIONS.  In this chapter:',
 '',
 '(1)  "Capital improvement" means any of the following facilities that have a life expectancy of three or more years and are owned and operated by or on behalf of a political subdivision:',
 '',
 '(A)  water supply, treatment, and distribution facilities; wastewater collection and treatment facilities; and storm water, drainage, and flood control facilities; whether or not they are located within the service area; and',
 '',
 '(B)  roadway facilities.',
 '',
 '(2)  "Capital improvements plan" means a plan required by this chapter that identifies capital improvements or facility expansions for which impact fees may be assessed.',
 '',
 '(3)  "Facility expansion" means the expansion of the capacity of an existing facility that serves the same function as an otherwise necessary new capital improvement, in order that the existing facility may serve new development.  The term does not include the repair, maintenance, modernization, or 

In [58]:
new_snips = []

snips = cleaned_snippets
i = 0
rule = ''
while i < len(snips):
    if snips[i] == '':
        new_snips.append(rule.strip())
        rule = ''
    else:
        rule += ' ' + snips[i]
    i += 1

In [59]:
final_snips = []
snips = new_snips
i = 1
rule = snips[0]
while i < len(snips):
    if snips[i] != '':
        if snips[i][0] == '(' or snips[i][0:4] == 'Sec.':  #edit second condition accordingly
            final_snips.append(rule.strip())
            rule = snips[i]
        else:
            rule += snips[i]
    i += 1


In [60]:
final_snips

['Sec. 395.001.  DEFINITIONS.  In this chapter:',
 '(1)  "Capital improvement" means any of the following facilities that have a life expectancy of three or more years and are owned and operated by or on behalf of a political subdivision:',
 '(A)  water supply, treatment, and distribution facilities; wastewater collection and treatment facilities; and storm water, drainage, and flood control facilities; whether or not they are located within the service area; and',
 '(B)  roadway facilities.',
 '(2)  "Capital improvements plan" means a plan required by this chapter that identifies capital improvements or facility expansions for which impact fees may be assessed.',
 '(3)  "Facility expansion" means the expansion of the capacity of an existing facility that serves the same function as an otherwise necessary new capital improvement, in order that the existing facility may serve new development.  The term does not include the repair, maintenance, modernization, or expansion of an existing 

In [61]:
rules = final_snips

In [62]:
new_rules = []
for rule in rules:
    if rule.startswith('Sec. 395.'): # edit this accordingly
        a = rule.find('(a)')
        if a != -1:
            new_rules.append(rule[0:a].strip())
            new_rules.append(rule[a:].strip())
        else:
            new_rules.append(rule)
    else:
        new_rules.append(rule)
rules = new_rules

In [63]:
rules

['Sec. 395.001.  DEFINITIONS.  In this chapter:',
 '(1)  "Capital improvement" means any of the following facilities that have a life expectancy of three or more years and are owned and operated by or on behalf of a political subdivision:',
 '(A)  water supply, treatment, and distribution facilities; wastewater collection and treatment facilities; and storm water, drainage, and flood control facilities; whether or not they are located within the service area; and',
 '(B)  roadway facilities.',
 '(2)  "Capital improvements plan" means a plan required by this chapter that identifies capital improvements or facility expansions for which impact fees may be assessed.',
 '(3)  "Facility expansion" means the expansion of the capacity of an existing facility that serves the same function as an otherwise necessary new capital improvement, in order that the existing facility may serve new development.  The term does not include the repair, maintenance, modernization, or expansion of an existing 

In [64]:
rule_ids = []
chapter = ""
subchapter = ""
number = ""
letter = ""
roman = ""
roman2=''
flag = flag2 = False
for rule in rules:
    if rule.startswith("Sec."):
        chapter = re.search(r"^Sec\. (\d+.\d+)", rule).group(1)
        subchapter = letter = number = roman = roman2 = ""
    elif rule.startswith("("):
        if re.search(r"^\([ivx]+\)", rule):
            if letter != '':
                flag = True
        if re.search(r"^\([IVX]+\)", rule):
            if roman != '':
                flag2 = True
                
        if re.search(r"^\([a-z]\)", rule) and not flag:
            subchapter = re.search(r"^\([a-z]\)", rule).group().replace('(','').replace(')','')
            letter = number = roman = roman2 =""
        elif re.search(r"^\(\d+\)", rule):
            number = re.search(r"^\(\d+\)", rule).group().replace('(','').replace(')','')
            letter = roman = roman2 = ""
        elif re.search(r"^\([A-Z]\)", rule) and not flag2:
            letter = re.search(r"^\([A-Z]\)", rule).group().replace('(','').replace(')','')
            roman = roman2 =""
        elif re.search(r"^\([ivx]+\)", rule):
            roman = re.search(r"^\([ivx]+\)", rule).group().replace('(','').replace(')','')
            flag = False
            roman2 = ''
        if roman != '':
            if re.search(r"^\([IVX]+\)", rule):
                roman2 = re.search(r"^\([IVX]+\)", rule).group().replace('(','').replace(')','')
                flag2 = False

    rule_id = chapter
    if subchapter != "":
        rule_id += "(" + subchapter + ")"
    if number != "":
        rule_id += "(" + number + ")"
    if letter != "":
        rule_id += "(" + letter + ")"
    if roman != "":
        rule_id += "(" + roman + ")"
    if roman2 != "":
        rule_id += "(" + roman2 + ")"

    rule_ids.append(rule_id)


In [65]:
import pandas as pd
df = pd.DataFrame({'rule': rules, 'rule_id': rule_ids})

In [66]:
# Assuming df is your DataFrame
df['rule'] = df['rule'].replace(to_replace = r'^.*?\s', value = '', regex = True)

In [67]:
df.set_index('rule_id', inplace=True)

In [68]:
#check for duplicates and iterate above until therea are none.
df[df.index.duplicated()]

Unnamed: 0_level_0,rule
rule_id,Unnamed: 1_level_1
395.001,395.001. DEFINITIONS. In this chapter:
395.001(1),"""Capital improvement"" means any of the follow..."
395.001(1)(A),"water supply, treatment, and distribution fac..."
395.001(1)(B),roadway facilities.
395.001(2),"""Capital improvements plan"" means a plan requ..."
...,...
395.081(a),This section applies only to a municipality w...
395.081(b),A municipality that has not adopted an impact...
395.081(b)(1),the governing body of the municipality has ad...
395.081(b)(2),the landowner agrees to pay a proportional sh...


In [69]:
def infer_parent_rules(rules_df):  
    for rule_id in rules_df.index:
        if rule_id[-1] == ")":
            parent_rule_id = re.search(r"^(.*)\(", rule_id).group(1)
            rules_df.loc[rule_id, 'parent_rule_id'] = parent_rule_id
        else:
            rules_df.loc[rule_id, 'parent_rule_id'] = ''
    return rules_df

rules_df = infer_parent_rules(df)

In [70]:
def get_rule_embedding(rule_id, rules_df):
    """Generate concatenated embedding string for a rule and its parents.
    Edit to expermint with other embeddings formatting"""
    embedding = f"{rule_id}:\n{rules_df.loc[rule_id, 'rule']}"
    parent_rule_id = rules_df.loc[rule_id, 'parent_rule_id']
    
    while parent_rule_id != '':
        parent_rule = rules_df.loc[parent_rule_id, 'rule']
        embedding = f'{parent_rule_id}:\n{parent_rule}\n\n{embedding}'
        parent_rule_id = rules_df.loc[parent_rule_id, 'parent_rule_id']
        
    return embedding
## send rule_ids which is the index of the df
df['full rule'] = df.apply(lambda row: get_rule_embedding(row.name, df), axis=1)

In [77]:
print(df['full rule'].sample(1).values[0])

395.014:
395.014.  CAPITAL IMPROVEMENTS PLAN.

395.014(a):
 The political subdivision shall use qualified professionals to prepare the capital improvements plan and to calculate the impact fee.  The capital improvements plan must contain specific enumeration of the following items:

395.014(a)(7):
 a plan for awarding:

395.014(a)(7)(A):
 a credit for the portion of ad valorem tax and utility service revenues generated by new service units during the program period that is used for the payment of improvements, including the payment of debt, that are included in the capital improvements plan;  or


In [78]:
from openai.embeddings_utils import get_embedding
# embedding model parameters
embedding_model = "text-embedding-ada-002"
embedding_encoding = "cl100k_base"  # this the encoding for text-embedding-ada-002
max_tokens = 8000  # the maximum for text-embedding-ada-002 is 8191

df["embeddings"] = df['full rule'].apply(lambda x: get_embedding(x, engine=embedding_model))

In [79]:
df.to_csv(f'./raw_data/{chap_name}.csv')

In [80]:
df.to_pickle(f'./rule_pkls/{chap_name}.pkl')

In [82]:
df

Unnamed: 0_level_0,rule,parent_rule_id,full rule,embeddings,Cluster
rule_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
395.001,395.001. DEFINITIONS. In this chapter:,,395.001:\n395.001. DEFINITIONS. In this chap...,"[0.0014194968389347196, 0.02503112703561783, 0...",0
395.001(1),"""Capital improvement"" means any of the follow...",395.001,395.001:\n395.001. DEFINITIONS. In this chap...,"[0.001543219550512731, 0.014363299123942852, -...",5
395.001(1)(A),"water supply, treatment, and distribution fac...",395.001(1),395.001:\n395.001. DEFINITIONS. In this chap...,"[0.00212496193125844, 0.009922078810632229, -0...",5
395.001(1)(B),roadway facilities.,395.001(1),395.001:\n395.001. DEFINITIONS. In this chap...,"[0.005122652277350426, 0.01460730005055666, -0...",5
395.001(2),"""Capital improvements plan"" means a plan requ...",395.001,395.001:\n395.001. DEFINITIONS. In this chap...,"[0.006087194196879864, 0.006986063439399004, 0...",5
...,...,...,...,...,...
395.081(a),This section applies only to a municipality w...,395.081,395.081:\n395.081. FEES FOR ADJOINING LANDOWN...,"[0.01068386435508728, 0.03134826570749283, 0.0...",7
395.081(b),A municipality that has not adopted an impact...,395.081,395.081:\n395.081. FEES FOR ADJOINING LANDOWN...,"[-0.0026179684791713953, 0.008856709115207195,...",9
395.081(b)(1),the governing body of the municipality has ad...,395.081(b),395.081:\n395.081. FEES FOR ADJOINING LANDOWN...,"[-0.0026099986862391233, 0.009454330429434776,...",9
395.081(b)(2),the landowner agrees to pay a proportional sh...,395.081(b),395.081:\n395.081. FEES FOR ADJOINING LANDOWN...,"[-0.0028980381321161985, 0.008199654519557953,...",9
