# Lexical Reduction

Hashing-based technique to reduce sequence datasets in CSV format to the lexical format used by SPMF (https://www.philippe-fournier-viger.com/spmf). Useful for applying Sequential Pattern/Rule Mining approaches to non-text data.

In [1]:
import pandas as pd
import json

In [2]:
# Load data as pandas DataFrame 
df = pd.read_csv('../Data Sources/Cleaned/US_RCC_Database_CLEANED.csv', skipinitialspace=True)
df = df.infer_objects()
df = df.drop(['Database ID', 'COMMENTS'], axis=1)

In [3]:
# Exclude non-categorical feats (TODO: Bin these)
df = df.select_dtypes(exclude = ['float64', 'datetime64[ns]'])

In [4]:
# Create output file
with open("encoded_dataset.txt", "w") as file:
    file.write("")

output = open("encoded_dataset.txt", "a")

In [5]:
token_dict = {}
i = 0

for sample, row in df.iterrows():
    token_seq = []
    for index, value in row.items():
        key = index + "@" + str(value)
        if key not in token_dict:
            token_dict[key] = str(i)
            i += 1
        token_seq.append(token_dict[key])
    token_seq.append("-2")
    output.write(" -1 ".join(token_seq))
    output.write("\n")

output.close()

In [6]:
# Save dict to get values back out later
with open("token_mapping.json", "w") as file:
    json.dump(token_dict, file)