# Aggregating RefinementPAtterns

This notebook is responsible for aggregating the manually analyzed data into larger pattern classes. 

The input file is the result of the manual analysis step. It should be a csv with data for following columns: 
- contract ID
- initial_norm
- norm_refinement
- original_symboleo
- updated_symboleo
- additional_notes
- refinement_pattern

There should be separate csv's for the construction set and the test set.

## Import File

In [36]:
import csv
import pandas as pd


In [37]:
ROOT_PATH = '/content/drive/MyDrive/Masters/Thesis/contracts/rq3_actual'

#INPUT_FILENAME = '3_input_norm_set.csv' 
INPUT_FILENAME = '3_input_norm_set_test.csv'

#OUTPUT_FILENAME = '3_output_patterns.csv' 
OUTPUT_FILENAME = '3_output_norm_set_test.csv'

In [38]:
def read_pd_no_header(file_path):
    names = ['ID', 'initial_norm', 'norm_refinement', 'original_sym', 'updated_sym', 'notes', 'pattern']
    data = pd.read_csv(file_path, header=None, names=names)
    return data


In [40]:
# Read in the data
file_path = f'{ROOT_PATH}/{INPUT_FILENAME}'
df = read_pd_no_header(file_path)

df.head()

Unnamed: 0,ID,initial_norm,norm_refinement,original_sym,updated_sym,X,pattern
0,ON4COMMUNICATIONSINC_07_02_2009-EX-10.1-PROMOT...,upon execution of this agreement by both parti...,upon execution of this agreement by both parties,Happens,conditional,,upon EVENT
1,BravatekSolutionsInc_20170418_8-K_EX-10.1_1020...,for a registered opportunity where the custome...,within 5 days of receipt of payment (paid when...,Happens,HappensBefore,,within TIMESPAN of EVENT
2,"SOLUTIONSVENDINGINTERNATIONAL,INC_03_31_2020-E...",if you believe that the company has billed you...,no later than 60 days after the closing date o...,Happens,HappensBefore,,no later than TIMESPAN after EVENT
3,GluMobileInc_20070319_S-1A_EX-10.09_436630_EX-...,"(ii) ""ice age 2"" guarantee and royalty: in add...",on or before *****,Happens,HappensBefore,,on or before DATE
4,TICKETSCOMINC_06_22_1999-EX-10.22-SPONSORSHIP ...,"upon request, tickets shall have access to per...",upon request,Happens,conditional,,upon EVENT


## Data Cleaning and Aggregation

In [41]:
# Create a new column that lists the operation from original sym to updated sym
df['operation'] = df['original_sym'] + ' -> ' + df['updated_sym']

# Reduce the dataframe to the pattern and operation
df = df[['operation', 'pattern']]

# Print out the unique operations
all_ops = op_cols = list(df['operation'].unique())

print(all_ops)

['Happens -> conditional', 'Happens -> HappensBefore', 'Happens -> suspend obligation', 'Happens -> HappensAfter', 'Happens -> HappensWithin', 'termination power -> conditional', 'termination power -> notice to terminate', 'non-obligation -> flip obligation', 'termination power (auto) -> conditional', 'right -> new obligation', 'power to suspend -> power to resume']


In [42]:
# Given a grouping, gets the pattern counts
# Returns a tuple (operation, pattern, counts)
COLUMN_NAME = 'pattern'
def get_counts(key, grouping, column_name = COLUMN_NAME):
    counts = grouping.groupby(column_name).size()
    op_range = [key]*len(counts)
    counts_list = list(zip(op_range, counts.index, counts.values))
    return counts_list

In [43]:
# Iterate through operations and get the counts for each pattern
results = []

for col_name in all_ops:
    grouping = df[df['operation'] == col_name]
    counts = get_counts(col_name, grouping)
    results.extend(counts)


## Optional: Automatic pattern classes

The resulting file will need to be manually analyzed to identify the pattern classes found in the refinement patterns. You may wish to automate part of this process by using a dictionary such as the one below.

In [45]:
pattern_dict = {
    "after EVENT": "CONDITIONAL_A EVENT",
    "after TIMEPOINT": "CONDITIONAL_A EVENT",
    "if EVENT": "CONDITIONAL_A EVENT",
    "in the case of EVENT": "CONDITIONAL_A EVENT",
    "in the event EVENT": "CONDITIONAL_A EVENT",
    "in the event of EVENT": "CONDITIONAL_A EVENT",
    "provided that EVENT": "CONDITIONAL_A EVENT",
    "upon EVENT": "CONDITIONAL_A EVENT",
    "upon termination": "CONDITIONAL_A EVENT",
    "when EVENT": "CONDITIONAL_T EVENT",
    "TIMESPAN later than EVENT": "TIMESPAN P_AFTER_PF EVENT",
    "no later than DATE": "P_BEFORE_S DATE",
    "no later than TIMESPAN after EVENT": "[at least] TIMESPAN P_AFTER_PF EVENT",
    "on EVENT": "CONDITIONAL_A EVENT",
    "on or before DATE": "P_BEFORE_S DATE",
    "on or before TIMEPOINT": "P_BEFORE_S DATE",
    "prior to EVENT": "CONDITIONAL_A EVENT",
    "within TIMESPAN": "within TIMESPAN P_AFTER_W EVENT",
    "within TIMESPAN after EVENT": "within TIMESPAN P_AFTER_W EVENT",
    "within TIMESPAN after TIMEPOINT": "within TIMESPAN P_AFTER_W EVENT",
    "within TIMESPAN from TIMEPOINT": "within TIMESPAN P_AFTER_W EVENT",
    "within TIMESPAN of EVENT": "within TIMESPAN P_AFTER_W EVENT",
    "after TIMESPAN": "after TIMESPAN",
    "unless agreed": "P_EXCEPT EVENT",
    "without EVENT": "P_EXCEPT EVENT",
    "during TERM": "P_DURING TIME_PERIOD",
    "in case EVENT": "CONDITIONAL_A EVENT",
    "in the event that EVENT": "CONDITIONAL_A EVENT",
    "on EVENT": "CONDITIONAL_A EVENT",
    "upon EVENT": "CONDITIONAL_A EVENT",
    "after TIMESPAN from TIMEPOINT": "for TIMESPAN P_AFTER_I EVENT",
    "by EVENT": "CONDITIONAL_A EVENT",
    "by providing TIMESPAN notice": "CONDITIONAL_A EVENT",
    "in the event EVENT": "CONDITIONAL_A EVENT",
    "within TIME PERIOD": "P_DURING TIME_PERIOD"
}

In [46]:
new_results = []

for x in results:
    v = '??'
    if x[1] in pattern_dict:
        v = pattern_dict[x[1]]
    next_res = (x[0], x[1], x[2], v)
    new_results.append(next_res)

results = new_results


## Save Results

In [47]:
# Output results to a file
result_path = f'{ROOT_PATH}/{OUTPUT_FILENAME}'

with open(result_path, 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(results)