# Confirm Text/Document ID Coverage
All (Pcc) texts in the original pile files as saved in `pile_tables/raw/` should show up in either the exclusions file for that data group (i.e. as row in the exclusions dataframe) or the final conllu directory (i.e. as a document ~ `new_doc` comment in one of the conllu files there). 

This notebook works through collecting the IDs found in each of these 3 places and comparing the resulting objects to confirm nothing has been lost.

_One possible concern however, is that the exclusions files may be inaccurate. That is, there are documents that were erroneously marked as `fail`s and skipped during parsing, but then successfully completed in a following reparse. So really, it's just the comparison of the raw dataframes and the conllu files that matter, and any that are missing should be in the exclusions dataframe. After this comparison is drawn, documents/texts added to the exclusions in error (i.e. IDs that actually do have parsed sentences in a conllu file) should be removed from the exclusions dataframe_

If there are raw text IDs not found in the conllu files or the exclusions, the final (top level `pile_tables/`) and temporary (`pile_tables/tmp/`) should be searched and/or the dataframes in `slices/[data group]/tmp/` should be compared with "final" slices in `slices/[data group]/`.

In [63]:
# coding=utf-8
#!/home/arh234/.conda/envs/dev-sanpi/bin/python

import pandas as pd
import pyconll

from pathlib import Path
# from datetime import datetime
# tstamp = datetime.fromtimestamp
DATA_DIR = Path('/share/compling/data/puddin')
DATA_GRP = 'val'
DF_NAME = f'pile_{DATA_GRP}_Pile-CC_df.pkl.gz'


## Load meta info dataframe

In [64]:
meta = pd.read_csv(DATA_DIR.joinpath('completed-puddin_meta-index.csv'))
meta.head()

Unnamed: 0,record,slice_name,total_texts,first_text_id,last_text_id,tmp_slice_path,final_slice_path,conllu_path,origin_filepath,data_origin_group,...,seconds,kept_df_mtime,excl_df_mtime,slice_df_mtime,conllu_mtime,kept_df_gzMB,excl_df_gzMB,slice_df_gzMB,conllu_MB,end_timedelta
0,29-001-0013,Pcc29_1,9999,pcc_eng_29_001.0001_x0000001,pcc_eng_29_001.9999_x0016002,pile_tables/slices/Pcc29/tmp/pile_29-001_Pile-...,pile_tables/slices/Pcc29/pile_29-001_Pile-CC_d...,Pcc29.conll/pcc_eng_29-001.conllu,/share/compling/data/pile/train/29.jsonl,29,...,3094,2022-03-27 02:12:00,2022-04-30 03:31:00,2022-04-12 21:17:00,2022-04-12 22:13:00,1015.2,0.09,9.33,376.85,0.0
1,28-001-0014,Pcc28_1,9999,pcc_eng_28_001.0001_x0000001,pcc_eng_28_001.9999_x0016230,pile_tables/slices/Pcc28/tmp/pile_28-001_Pile-...,pile_tables/slices/Pcc28/pile_28-001_Pile-CC_d...,Pcc28.conll/pcc_eng_28-001.conllu,/share/compling/data/pile/train/28.jsonl,28,...,2505,2022-04-02 06:13:00,2022-06-28 15:20:00,2022-04-12 21:27:00,2022-04-12 22:13:00,1014.1,1750.58,9.36,377.91,0.0
2,28-002-0016,Pcc28_2,9999,pcc_eng_28_002.0001_x0016232,pcc_eng_28_002.9999_x0032272,pile_tables/slices/Pcc28/tmp/pile_28-002_Pile-...,pile_tables/slices/Pcc28/pile_28-002_Pile-CC_d...,Pcc28.conll/pcc_eng_28-002.conllu,/share/compling/data/pile/train/28.jsonl,28,...,2514,2022-04-02 06:13:00,2022-06-28 15:20:00,2022-04-12 21:27:00,2022-04-12 22:55:00,1014.1,1750.58,9.29,375.33,0.0
3,29-002-0017,Pcc29_2,9999,pcc_eng_29_002.0001_x0016004,pcc_eng_29_002.9999_x0032274,pile_tables/slices/Pcc29/tmp/pile_29-002_Pile-...,pile_tables/slices/Pcc29/pile_29-002_Pile-CC_d...,Pcc29.conll/pcc_eng_29-002.conllu,/share/compling/data/pile/train/29.jsonl,29,...,3188,2022-03-27 02:12:00,2022-04-30 03:31:00,2022-04-12 21:17:00,2022-04-12 23:06:00,1015.2,0.09,9.64,393.87,0.0
4,28-003-0019,Pcc28_3,9999,pcc_eng_28_003.0001_x0032273,pcc_eng_28_003.9999_x0048538,pile_tables/slices/Pcc28/tmp/pile_28-003_Pile-...,pile_tables/slices/Pcc28/pile_28-003_Pile-CC_d...,Pcc28.conll/pcc_eng_28-003.conllu,/share/compling/data/pile/train/28.jsonl,28,...,2512,2022-04-02 06:13:00,2022-06-28 15:20:00,2022-04-12 21:27:00,2022-04-12 23:37:00,1014.1,1750.58,9.42,381.21,0.0


For each row (i.e. slice) compare the text ids found in the files at the following paths: raw, final, conllu. Make sure any missing from conllu are in exclusions dataframe.

_Should probably group by exclusions path/data group/original data source column_

In [65]:
def newdocs_iter(conll_dir):
    for f in conll_dir.glob('*.conllu'):
        for s in pyconll.iter_from_file(str(f)):
            try:
                doc_id = s.meta_value('newdoc id')
            except KeyError:
                continue
            recons_raw_id = '_'.join([x.replace('x', '')
                                     for x in doc_id.split('_')[0:5:2]])
            yield recons_raw_id

In [71]:
findfs = []
rawdfs = []
problems = []
for grp, mdf in meta.groupby('exclusions_path'):
    excl_path = DATA_DIR.joinpath(grp)
    # print(excl_path)
    #sanity checks
    if len(mdf.final_df_path.unique()) != 1: 
        print('WARNING! different paths showing for final df')
    if len(mdf.origin_filepath.unique()) != 1: 
        print('WARNING! different paths showing for source file')
    findf_path = DATA_DIR.joinpath(mdf.final_df_path.iloc[0])
    # print(findf_path)
    rawdf_path = DATA_DIR.joinpath(f'pile_tables/raw/{findf_path.name}')
    # print(rawdf_path)
    conll_dir = DATA_DIR.joinpath(Path(mdf.conllu_path.iloc[0]).parent)
    # print(conllu_dir)
    # print('...........\n')

    #* now check text ids for each
    rawdf = pd.read_pickle(rawdf_path)
    findf = pd.read_pickle(findf_path)
    excldf = pd.read_pickle(excl_path)
    conllu_ids = set(newdocs_iter(conll_dir))

    findf = findf.assign(late_excl = ~findf.text_id.isin(conllu_ids))
    rawdf = rawdf.assign(init_excl = ~rawdf.text_id.isin(findf.text_id))
    all_excl_ids = findf.text_id.loc[findf.late_excl] + rawdf.text_id.loc[rawdf.init_excl]
    unaccounted = [i for i in all_excl_ids not in excldf.text_id]
    print(unaccounted)
    problems.append(unaccounted)

##? turn this into a dataframe with raw text id as index and columns as booleans for e.g. 'in final df', 'in exclusions', 'in conllu parse', etc?



: 

: 

## 1. Get the original text IDs from the raw dataframes

In [None]:
raw_ex = pd.read_pickle(DATA_DIR.joinpath(
    f'pile_tables/raw/{DF_NAME}'))
# raw_ex


## 2. Get the text IDs from the final dataframe

In [None]:
fin_ex = pd.read_pickle(DATA_DIR.joinpath(f'pile_tables/{DF_NAME}'))
fin_ex


In [None]:
orig_excl = raw_ex.loc[~ raw_ex.text_id.isin(fin_ex.text_id),'text_id']

## 3. Get the document (text) IDs from the finalized conllu files

In [None]:
not_in_conllu = [i for i in fin_ex.text_id if i not in doc_ids]
not_in_conllu

In [None]:
fin_ex.text_id