# Jury Instructions Similarity Project
The purpose of this program is to run a similarities between jury instructions. The "base" jury instruction that will be compared to other instructions is the "Civil Pattern Jury Instructions" from the Georgia Superior Court, provided by the Internet Archive. 

In [1]:
# Make sure we have the needed libraries. You only need to run these once
#!pip install bs4
#!pip install html5lib

<p> 
    <i> BeautifulSoup and html5lib were installed to parse the JI html file. 
    <hr>

In [2]:
import re
import pandas as pd
import spacy
from scipy import spatial 
nlp = spacy.load("en_core_web_lg") #this can take a minute or two to load

<p> 
    <i> Regular Expressions was installed to organize the section heads and their corresponding text from the html file. Pandas placed the jury instructions, their section heads, and their vectors into a dataframe. Spacy vectorized the text. 
    <hr>

In [3]:
from bs4 import BeautifulSoup
with open("GAJI.html") as fp:
    soup = BeautifulSoup(fp, 'html.parser')

<p> 
    <i> Here, BeautifulSoup is parsing the html file. 
    <hr>

In [4]:
content = soup.find_all("div", class_ = 'ocr_carea')
len(content)

3054

<p> 
    <i> The html file that contains the JI was formatted one long body with minimal delineating tags. During the first pass through, I tried using the page numbers as a delineator, however that did not work because the sections run over multiple pages. "ocr_carea" worked nicely because it is as close to a paragraph/line tag that I could find. There are 3054 lines in the entire document. 
    <hr>

In [5]:
results = []

for result in content: 
    result = result.text
    result = result.replace("\n", " ")
    results.append(result)


results = results[114:2959]
results = (''.join(results))

<p> 
    <i> The html format was messy and filled with blank spaces which made it difficult to parse and organize the text. I created a for loop to clean up the results by replacing the line breaks with a single whitespace. I also spliced the document to just the relavent content. 
    <hr>

In [6]:
clean_sections_one = re.sub(r"Preliminary General Instructions", "", results)
clean_sections_two = re.sub(r"\d Updated", "", clean_sections_one)
clean_sections_three = re.sub(r"January \d{4}", "", clean_sections_two)
clean_sections_four = re.sub(r"General Instructions Updated", "", clean_sections_three)
clean_sections_five = re.sub(r"Updated July 2014", "", clean_sections_four)
clean_sections_six = re.sub(r"Updated August 2019", "", clean_sections_five)
clean_sections_seven = re.sub(r"Updated August 2020", "", clean_sections_six)
clean_sections_eight = re.sub(r"July \d{4}", "", clean_sections_seven)
sec_results = re.split("(?<![a-zA-Z]{2}) \d{2}\.\d{3}", clean_sections_eight)
print(type(sec_results))
print(len(sec_results))

<class 'list'>
694


<p> 
    <i> Regular expression is used to isolate the content of each section. Using the section heads which follow the pattern xx.xxx (numerically), negative look behind is splitting everything that is not xx.xxx. The length is checked to ensure that it is the same as the length of section heads. The sec_results are then printed into a list. 
        <hr>

In [7]:
section_list = re.findall("(?<![a-zA-Z]{2}) \d{2}\.\d{3}", results)
print(type(section_list))
print(len(section_list))

<class 'list'>
693


<p>
    <i> Using "findall" instead of split, the actual section heads are being isolated using the same regex shown above. 
        <hr>

In [8]:
#results_clean = []
#for string in sec_results:
    #if (string != ""):
        #results_clean.append(string)
#print(results_clean)

In [9]:
#type(results_clean)

In [10]:
doc = sec_results[1:815]

<p>
    <i> Splices sec_results to just the text that are the jury instructions, omitting the index and citations in the original document. 
        <hr>

In [11]:
docdf = pd.DataFrame([],columns=["section","text","vector"])
for x in doc:
    #print(x)
    docdf = docdf.append(pd.DataFrame([[x, None]], columns=["text","vector"]), ignore_index=True)
docdf["section"] = section_list
#docdf["similar instructions"] = "temp"

<p>
    <i> Pandas (as pd) creates the dataframe. Columns are defined as "section" for the section heads, "text" for the text of each section, "vector" for the nlp vectorized text, and "similar instructions" as a placeholder for similar jury instructions. "Similar instructions" will likely be replaced with the similarity function defined below. 
        <hr> 

In [12]:
def vectorize(row):
    return nlp(row["text"]).vector

<p> 
    <i> This function will vectorize the input running through the "text" column of the dataframe. 
        <hr>

In [13]:
docdf

Unnamed: 0,section,text,vector
0,00.000,Evidence; Note Regarding Changes Based upon N...,
1,00.005,(Civil): Before Introduction of Evidence ...,
2,00.010,"Generally Members of the jury, the case ...",
3,00.020,Evidence You determine the facts from the...,
4,00.030,Parties (Optional) You must consider this c...,
...,...,...,...
688,78.000,YEAR’S SUPPORT,
689,78.010,Year’s Support; Entitlement The surviving...,
690,78.020,Year’s Support; Amount of Award If object...,
691,78.030,Year’s Support; Burden of Proof The petitio...,


<p> 
    <i> Initial pass through of the dataframe. Vector function has not been applied. 
        <hr>

In [14]:
docdf["vector"] = docdf.apply(vectorize, axis=1)

<p>
    <i> Runs the vector function through each cell of the dataframe. 
        <hr>

In [15]:
docdf

Unnamed: 0,section,text,vector
0,00.000,Evidence; Note Regarding Changes Based upon N...,"[0.011602934, 0.11262337, -0.074103616, -0.079..."
1,00.005,(Civil): Before Introduction of Evidence ...,"[-0.021564402, 0.12485398, -0.062207088, -0.05..."
2,00.010,"Generally Members of the jury, the case ...","[-0.017002761, 0.18788356, -0.15679309, -0.019..."
3,00.020,Evidence You determine the facts from the...,"[-0.029937955, 0.12117469, -0.15202114, -0.046..."
4,00.030,Parties (Optional) You must consider this c...,"[-0.021735033, 0.08105373, -0.13502479, -0.060..."
...,...,...,...
688,78.000,YEAR’S SUPPORT,"[-0.0579694, 0.045098, 0.0048918007, 0.1186379..."
689,78.010,Year’s Support; Entitlement The surviving...,"[-0.030390836, 0.11698262, -0.14236122, -0.026..."
690,78.020,Year’s Support; Amount of Award If object...,"[-0.018070256, 0.10453738, -0.12511279, -0.035..."
691,78.030,Year’s Support; Burden of Proof The petitio...,"[-0.058484677, 0.09231084, -0.059771817, -0.02..."


<p>
    <i> Completed dataframe. 
        <hr>

In [17]:
with open("KAJI.html") as fp2:
    soup2 = BeautifulSoup(fp2, 'html.parser')
KAcontent = soup2.find_all("div", class_ = 'ocr_carea')

KAresults = []
for result in KAcontent: 
    result = result.text
    result = result.replace("\n", " ")
    KAresults.append(result)
KAresults = KAresults[756:7880]
KAresults = (''.join(KAresults))

KA_trimmed = re.sub(r"(\d{3}\.\d{2} Pattern Instructions Kansas \d)", "", KAresults)
trimmed_text = re.sub(r"Pattern Instructions Kansas \d", "", KA_trimmed)
headings_list = re.findall("(\d{3}\.\d{2})\s*[A-Z]{2}", trimmed_text)
section_starts = re.findall(r"(\d{3}\.\d{2}\s*)([A-Z]{2})", trimmed_text)
to_split = re.sub(r"(\d{3}\.\d{2}\s*)[A-Z]{2}", "---ABCDEFG1234---", trimmed_text)

clean_text = re.sub(r"\d{3}\.\d{2}", "", to_split)
cleaner_text = re.sub(r"\d{3}\-\d{2}", "", clean_text)
cleanest_text = re.sub(r"\d{3}\-\d ", "", cleaner_text)

section_ends = re.split(r"---ABCDEFG1234---", cleanest_text)
trimmed_section_ends = section_ends[1:]

section_text = []
for section_index in range(len(section_starts)):
    current_section_start_tuple = section_starts[section_index]
    current_first_two_letters = current_section_start_tuple[1]
    current_end = trimmed_section_ends[section_index]
    combined = current_first_two_letters + current_end
    section_text.append(combined)
    
KAdf = pd.DataFrame([],columns=["section","text","vector"])
for x in section_text:
    #print(x)
    KAdf = KAdf.append(pd.DataFrame([[x, None]], columns=["text","vector"]), ignore_index=True)
KAdf["section"] = headings_list

def vectorize(row):
    return nlp(row["text"]).vector

<p>
    <i> A second dataframe with Kansas jury instructions is being created. Same steps taken as GA jury instruction, just consolodated into one cell. 
        <hr>

In [18]:
KAdf

Unnamed: 0,section,text,vector
0,101.01,JURY HANDBOOK USAGE RECOMMENDED The Commi...,
1,101.02,HANDBOOK FOR JURORS I. The Importance of ...,
2,101.03,IMPORTANCE OF JURY SERVICE Comment T...,
3,101.04,METHOD OF SELECTING JURY PANEL Comment ...,
4,101.05,EXCUSES FROM JURY DUTY Comment This ...,
...,...,...,...
473,191.01,AUTO ACCIDENT Factual Summary Plaint...,
474,191.02,"AUTO ACCIDENT—TWO DRIVERS, TWO PASSENGERS ...",
475,191.11,MEDICAL MALPRACTICE—TWO DEFENDANTS— LOSS OF ...,
476,191.21,"CONTRACT, FRAUD Factual Allegations ...",


<p>
    <i> Initial pass through of the dataframe. Vector function has not been applied.
        <hr>

In [19]:
KAdf["vector"] = KAdf.apply(vectorize, axis=1)

<p>
    <i> Runs the vector function through each cell of the dataframe.
        <hr>

In [20]:
KAdf

Unnamed: 0,section,text,vector
0,101.01,JURY HANDBOOK USAGE RECOMMENDED The Commi...,"[-0.0071784426, 0.16388734, -0.040732857, -0.0..."
1,101.02,HANDBOOK FOR JURORS I. The Importance of ...,"[-0.026811205, 0.11125102, -0.13919875, -0.023..."
2,101.03,IMPORTANCE OF JURY SERVICE Comment T...,"[-0.060275327, 0.22410202, -0.01242239, -0.055..."
3,101.04,METHOD OF SELECTING JURY PANEL Comment ...,"[-0.05365543, 0.23475248, -0.012920184, -0.042..."
4,101.05,EXCUSES FROM JURY DUTY Comment This ...,"[-0.07529171, 0.19577427, -0.02056773, -0.0601..."
...,...,...,...
473,191.01,AUTO ACCIDENT Factual Summary Plaint...,"[-0.06553959, 0.16846004, -0.14060146, -0.0485..."
474,191.02,"AUTO ACCIDENT—TWO DRIVERS, TWO PASSENGERS ...","[-0.05621617, 0.16104051, -0.14128214, -0.0509..."
475,191.11,MEDICAL MALPRACTICE—TWO DEFENDANTS— LOSS OF ...,"[-0.066119775, 0.16307037, -0.14596342, -0.031..."
476,191.21,"CONTRACT, FRAUD Factual Allegations ...","[-0.064156264, 0.14843899, -0.14877954, -0.071..."


<p>
    <i> Completed KA dataframe
        <hr>

In [21]:
source_text = "Juror Use of Electronic Technology to Conduct"

def find_dist(row,source_text):
    v1 = nlp(source_text).vector
    v2 = row
    dist = 1-spatial.distance.cosine(v1,v2)
    return dist

def find_similar(df,source_text,n=5,sim=.90):
    df["similarity"] = df["vector"].apply(find_dist, args=(source_text,))
    df_similarity = (df.sort_values(by=["similarity"], ascending=False))
    return(df_similarity[df_similarity["similarity"]>=sim][:n])

<p>
    <i> "source_text" is the text that is being compared in the dataframe. Wihtin the "find_dist" function, v1 vectorizes source_text, v2 is defined as each row of a given dataframe, and dist uses cosine similarity to measure the distance between v1 and v2. 
        The "find_similar" function creates a new column within a given dataframe that shows the cosine similarity. It is also passing "source_text" through the function, sets the amount of similar JI to 5, and sets the minimum similarity to .90 (or 90% similar). The function also sorts the output in descending order.
        <hr>

In [22]:
find_similar(docdf, source_text, sim=.90)

Unnamed: 0,section,text,vector,similarity
184,0.11,Juror Use of Electronic Technology END O...,"[-0.099581, 0.19737351, 0.059162498, 0.0451945...",0.941348


<p> 
    <i> "find_similar" is passing the Georgia JI dataframe, the source text, and setting the minimum similarity to .90. Here, there is only one section of the Georgia JI that is at least 90% similar to the source text. 
        <hr>

In [36]:
html_string = """<html>
<head>
    <style>
        .content {
            margin:15px auto 0 auto;
            padding:0 15px;
            max-width:650px;
        }
        a.anchor {
            display:block;
            float:right;
            width:22px;
            height:16px;
            background-image: url("https://suffolklitlab.org/images/link.png");
            background-repeat: no-repeat;
            background-position: center;
        }
    </style>
</head>
<body>
    <div class="content">
    <h1>Kansas Jury Instructions w/ AI Annotations</h1>
    <p>[Preamble describing the nature of this annotated presentation]</p>
"""
for index, item in KAdf.iterrows():
    df_section = item["section"]
    df_text = item["text"]
    #sim_results = find_similar(docdf, df_text)
    html_string = html_string + "<hr><h4><a class=\"anchor\" name='"+df_section+"' href='#"+df_section+"'></a>"+df_section+"</h4><p>"+df_text
    #if len(sim_results)>0:
        #html_string = html_string + "<p>Similar sections in George JI: "
        #for index_2, row in sim_results.iterrows():
            #html_string = html_string + "<a href='Georgia.html#"+row["section"]+"'>"+row["section"]+"</a> ("+str(round(row["similarity"]*100))+"%); "
        #html_string = html_string + "</p>"
html_string = html_string + """
    <hr>
    <p style="text-align:center;margin-bottom:150px;">
        The code used to produce this can be found on <a href="https://github.com/SuffolkLITLab/Jury-Instructions/blob/main/web/Jury%20Instruction%20Similarity%20Project.ipynb" target="_blank">GitHub</a>.
    </p>
    </div>
</html>
"""
with open("Kansas.html","w") as file: 
    file.write(html_string)

<p> 
    <i> HTML construction. For loop iterates through the section column and the text column of the KA jury instruction dataframe. This allows the format of the HTML to produce the section head above the text of each section. An empty a-tag is created to allow for the possibility of the KA jury instruction to be the base in a different similarity. 
        <hr>

In [45]:
html_string = """<html>
<head>
    <style>
        .content {
            margin:15px auto 0 auto;
            padding:0 15px;
            max-width:650px;
        }
        a.anchor {
            display:block;
            float:right;
            width:22px;
            height:16px;
            background-image: url("https://suffolklitlab.org/images/link.png");
            background-repeat: no-repeat;
            background-position: center;
        }
    </style>
</head>
<body>
    <div class="content">
    <h1>Georgia Jury Instructions w/ AI Annotations</h1>
    <p>[Preamble describing the nature of this annotated presentation]</p>
"""
for index, item in docdf[:3].iterrows(): # you can comment out the line below and use this one to test with only a few rows
#for index, item in docdf.iterrows():
    df_section = item["section"]
    df_text = item["text"]
    sim_results = find_similar(KAdf, df_text)
    html_string = html_string + "<hr><h4><a class=\"anchor\" name='"+df_section+"' href='#"+df_section+"'></a>"+df_section+"</h4><p>"+df_text
    if len(sim_results)>0:
        html_string = html_string + "<p>Similar sections in Kansas JI: "
        for index_2, row in sim_results.iterrows():
            html_string = html_string + "<a href='Kansas.html#"+row["section"]+"'>"+row["section"]+"</a> ("+str(round(row["similarity"]*100))+"%); "
        html_string = html_string + "</p>"
html_string = html_string + """
    <hr>
    <p style="text-align:center;margin-bottom:40px;">
        The code used to produce this can be found on <a href="https://github.com/SuffolkLITLab/Jury-Instructions/blob/main/web/Jury%20Instruction%20Similarity%20Project.ipynb" target="_blank">GitHub</a>.
    </p>
    </div>
</html>
"""
with open("Georgia.html","w") as file: 
    file.write(html_string)

<p> 
    <i> Creates the html using the information from the dataframe. Similar to the KA HTML above, the "section" and "text" columns of the GA dataframe are nested within the for loop. The find_similar function is being used to measure the similarity between the Georgia jury instruction and the Kansas jury instruction. The a-tag is now being used to link the KA similar sections when applicable below the section text of the GA jury instruction. Since the entire KA jury instruction is being iterated through each section of the GA jury instruction, it takes a long time for it to run. However, docdf.iterrows can be spliced to certain sections or certain group of sections which does reduce processing time. 
        <hr>