# Create the Corpus Problem Lists

The notebook below creates the lists of documents for the corpus and datatype. This is used for iterating across the documents in the corpus when creating jobscripts that pull the known and unknown documents. To run this for a new set of data simply change the **corpus** and **data_type** parameters.

In [205]:
import sys
import pandas as pd

from from_root import from_root

sys.path.insert(0, str(from_root("src")))

from utils import get_base_location, build_metadata_df, apply_temp_doc_id
from read_and_write_docs import read_jsonl, read_rds

In [206]:
corpus      = "Yelp"
data_type   = "test"

# Set NAS so can run on Windows laptop seamlessly
nas_base_loc = get_base_location()

known_loc = f"{nas_base_loc}/datasets/author_verification/{data_type}/{corpus}/known_raw.jsonl"
unknown_loc = f"{nas_base_loc}/datasets/author_verification/{data_type}/{corpus}/unknown_raw.jsonl"
metadata_loc = f"{nas_base_loc}/datasets/author_verification/{data_type}/metadata.rds"

save_loc = f"{nas_base_loc}/datasets/author_verification/{data_type}/{corpus}"

## Read Data

In [207]:
metadata = read_rds(metadata_loc)
filtered_metadata = metadata[metadata['corpus'] == corpus]

known = read_jsonl(known_loc)
unknown = read_jsonl(unknown_loc)

## Create Metadata

Quite a convoluted process.

In [208]:
# Build the dataframe
complete_metadata = build_metadata_df(filtered_metadata, known, unknown)

# Set blank text column for function to work
complete_metadata['text'] = ''

# Rename the known column and create the new doc_id
complete_metadata.rename(columns={"known_doc_id": "orig_doc_id"}, inplace=True)
complete_metadata = apply_temp_doc_id(complete_metadata)
complete_metadata.rename(columns={
    "orig_doc_id": "orig_known_doc_id",
    "doc_id": "known_doc_id",
    "unknown_doc_id": "orig_doc_id"
}, inplace=True)

# Do the same for the unknown
complete_metadata = apply_temp_doc_id(complete_metadata)
complete_metadata.rename(columns={
    "orig_doc_id": "orig_unknown_doc_id",
    "doc_id": "unknown_doc_id",
}, inplace=True)

# Sort columns
complete_metadata = complete_metadata[["sample_id", "problem", "corpus", "known_doc_id", "unknown_doc_id"]]

## View the data

In [209]:
complete_metadata.head()

Unnamed: 0,sample_id,problem,corpus,known_doc_id,unknown_doc_id
0,1,--kedvpjB1PT28X_gArafA vs --kedvpjB1PT28X_gArafA,Yelp,_kedvpjb1pt28x_garafa_review_14_2_8_2008_,_kedvpjb1pt28x_garafa_review_20_4_update_10_2008_
1,2,--kedvpjB1PT28X_gArafA vs --kedvpjB1PT28X_gArafA,Yelp,_kedvpjb1pt28x_garafa_review_15_2_8_2008_,_kedvpjb1pt28x_garafa_review_20_4_update_10_2008_
2,3,--kedvpjB1PT28X_gArafA vs --kedvpjB1PT28X_gArafA,Yelp,_kedvpjb1pt28x_garafa_review_19_10_8_2008_,_kedvpjb1pt28x_garafa_review_20_4_update_10_2008_
3,4,--kedvpjB1PT28X_gArafA vs --kedvpjB1PT28X_gArafA,Yelp,_kedvpjb1pt28x_garafa_review_2_2_8_2008_,_kedvpjb1pt28x_garafa_review_20_4_update_10_2008_
4,5,--kedvpjB1PT28X_gArafA vs --kedvpjB1PT28X_gArafA,Yelp,_kedvpjb1pt28x_garafa_review_22_26_2_2011_,_kedvpjb1pt28x_garafa_review_20_4_update_10_2008_


## Create the Document Lists

In [210]:
known_doc_list = pd.Series(complete_metadata["known_doc_id"].astype(str))
unknown_doc_list = pd.Series(complete_metadata["unknown_doc_id"].astype(str))
problem_doc_list = known_doc_list + ' vs ' + unknown_doc_list

## Get Number of Rows in the Dataset

This is used for the jobscript.

In [211]:
num_rows_for_jobscript = complete_metadata.shape[0]
print(f"Number of rows needed in jobscript: {num_rows_for_jobscript}")

Number of rows needed in jobscript: 2400


## Save the Lists

In [212]:

print("Writing problem doc list")
problem_doc_list.to_csv(f"{save_loc}/problem_doc_list.txt", index=False, header=False)
print("Writing known doc list")
known_doc_list.to_csv(f"{save_loc}/known_doc_list.txt", index=False, header=False)
print("Writing unknown doc list")
unknown_doc_list.to_csv(f"{save_loc}/unknown_doc_list.txt", index=False, header=False)
print("Wrote doc lists")

Writing problem doc list
Writing known doc list
Writing unknown doc list
Wrote doc lists
