# Aggreagated and Profile Raw Table Creation

The purpose of this notebook is to take the corpus metadata and the known and unknown tables and create the relevant profile and aggregated tables. Aggregated here means in preparation for an aggregated AV method so we have a row for each known document in the problems. Profile means for each problem we have a single row and all known documents are concatenated together.

In [91]:
# 1) Set the location
%cd ../../code/

[Errno 2] No such file or directory: '../../code/'
/Users/user/Documents/GitHub/paraphrase_py/code


In [92]:
from read_and_write_docs import read_jsonl, read_rds, write_jsonl

In [93]:
data_type = "training"

metadata_loc = f"/Volumes/BCross/datasets/author_verification/{data_type}/metadata.rds"
known_loc = f"/Volumes/BCross/datasets/author_verification/{data_type}/known_raw_dataframe.rds"
unknown_loc = f"/Volumes/BCross/datasets/author_verification/{data_type}/unknown_raw_dataframe.rds"
base_save_loc = f"/Volumes/BCross/datasets/author_verification/{data_type}"

In [94]:
known = read_rds(known_loc)
unknown = read_rds(unknown_loc)
metadata = read_rds(metadata_loc)

In [95]:
# Join the known data - will add a row for all known docs in problem
merged = (
    metadata
        .merge(
            known,
            how="left",           # keep every row from metadata
            left_on="known_author",     # column in metadata
            right_on="author",
            suffixes=("", "_known")  # avoids name clashes if both frames share columns
        )
        .drop(columns="known_author")   # optional: remove the join key from `known`
)

# Join the unknown data - single row for each row already there
merged = (
    merged
        .merge(
            unknown,
            how="left",           # keep every row from metadata
            left_on="unknown_author",     # column in metadata
            right_on="author",
            suffixes=("", "_unknown")  # avoids name clashes if both frames share columns
        )
        .drop(columns="unknown_author")   # optional: remove the join key from `known`
)

# Rename and select columns
merged.rename(columns={
    "doc_id": "doc_id_known",
    "author": "author_known",
    "text": "text_known"
}, inplace=True)

merged = merged[["problem", "corpus", "doc_id_known", "doc_id_unknown",
                 "author_known", "author_unknown", "text_known", "text_unknown"]]

## Aggregated Table

The aggregated table is just the merged table. It has a single row per known document, which will be compared with the unknown document.

We then save the aggregated tables in their respective areas.

In [96]:
# This version is the aggregated version, we now need the profile version
aggregated = merged.copy()

In [97]:
aggregated.groupby('corpus').size()

corpus
ACL                   186
All-the-news         1776
Amazon               6400
Enron                 224
IMDB                  400
Koppel's Blogs       3600
Perverted Justice     380
Reddit               2400
StackExchange         150
The Apricity          900
The Telegraph         440
TripAdvisor           120
Wiki                  450
Yelp                 1600
dtype: int64

In [99]:
corpus_list = list(set(aggregated['corpus']))

for corpus in corpus_list:
    print(f"Saving {corpus} aggregated dataframe")
    aggregated_filtered = aggregated[aggregated['corpus'] == corpus]
    
    save_loc = f"{base_save_loc}/{corpus}/aggregated_raw.jsonl"
    
    write_jsonl(aggregated_filtered, save_loc)

Saving ACL aggregated dataframe
Saving Perverted Justice aggregated dataframe
Saving Reddit aggregated dataframe
Saving Yelp aggregated dataframe
Saving TripAdvisor aggregated dataframe
Saving All-the-news aggregated dataframe
Saving Koppel's Blogs aggregated dataframe
Saving StackExchange aggregated dataframe
Saving IMDB aggregated dataframe
Saving Amazon aggregated dataframe
Saving The Apricity aggregated dataframe
Saving Wiki aggregated dataframe
Saving The Telegraph aggregated dataframe
Saving Enron aggregated dataframe


## Profile Table

The profile table concatenates all of the known documents in each problem into a single string with a newline separating the text. The idea is now we have all known documents vs the single unknown document for each problem.

We then save the profile tables in the respective corpus areas.

In [100]:
# Columns i don't want to aggregate
group_cols = [c for c in merged.columns if c not in ("doc_id_known", "text_known")]

profile = (
    merged
        .groupby(group_cols, as_index=False)
        .agg({
            "text_known": lambda x: "\n".join(x.dropna()),   # concat with newline
            "doc_id_known": list                            # collect into Python list
        })
)

profile = profile[["problem", "corpus", "doc_id_known", "doc_id_unknown",
                   "author_known", "author_unknown", "text_known", "text_unknown"]]

In [101]:
profile.groupby('corpus').size()

corpus
ACL                   186
All-the-news         1771
Amazon               1600
Enron                  64
IMDB                  400
Koppel's Blogs       1200
Perverted Justice     208
Reddit                800
StackExchange         150
The Apricity          228
The Telegraph         220
TripAdvisor           120
Wiki                  150
Yelp                  320
dtype: int64

In [102]:
corpus_list = list(set(profile['corpus']))

for corpus in corpus_list:
    print(f"Saving {corpus} profile dataframe")
    profile_filtered = profile[profile['corpus'] == corpus]
    
    save_loc = f"{base_save_loc}/{corpus}/profile_raw.jsonl"
    
    write_jsonl(profile_filtered, save_loc)

Saving Perverted Justice profile dataframe
Saving Reddit profile dataframe
Saving ACL profile dataframe
Saving Yelp profile dataframe
Saving TripAdvisor profile dataframe
Saving All-the-news profile dataframe
Saving Koppel's Blogs profile dataframe
Saving StackExchange profile dataframe
Saving IMDB profile dataframe
Saving The Apricity profile dataframe
Saving Wiki profile dataframe
Saving Amazon profile dataframe
Saving The Telegraph profile dataframe
Saving Enron profile dataframe
