# Text Sizes Documentation and Validation Checks

### Description

The Text Sizes.csv file contains text sizes in number of words by text and section, usually book or work. In addition there is a special value in the Section field indicating the total number of words in the work as a whole. The is the ```__ALL__``` value. All these values are from Accordance version 12 and the various Greek works I own in Accordance at this time.



|Field | Description |
|:----- |:----------- |
|Author | The approximate or commonly assumed author. This not populated with any critically authoritative information. See note below.|
|Text |This is the text module name in Accordance.|
|Section|This the book or section name.|
|Total Words|This is the total number of words in the section. Where Section == ```__ALL__``` it is the total number of words in the Text.|
|IncludedInTotal|True if this row's Total Words are included in the ```__ALL__``` value. Must be False for ```__ALL__```, and for any other sections not included.|
|AccWorkspace|The Accordance workspace name in the repo. used to produce the data.|


Note, the Author field is only of any value in some cases. In the investigation that spawned the creation of this data I was looking at works by Luke. In many cases, I have listed the Author as Various or Unknown. It is debatable that the field should even be present.

In [51]:
import pandas as pd
size_data = pd.read_csv("Text Sizes.csv")

The texts in the data are these.

In [52]:
sorted(size_data['Text'].drop_duplicates())

['Apostolic Fathers Greek',
 'Athanasius Greek',
 'Basil Greek',
 'Epictetus Greek',
 'Gospel of Thomas (Greek)',
 'Gregory Greek',
 'Gregory of Nyssa (Greek)',
 'Josephus Greek',
 'LXX Rahlfs Tagged',
 'LXX Swete',
 'LXX Swete (Enoch)',
 'LXX Swete (parallel texts)',
 'NA28 GNT',
 'Philo Greek',
 'Pseudepigrapha Greek',
 'Pseudo-Clem. Homilies Greek']

### Validation Checks
These validation checks just validate that summing up all sections for a work, where IncludedInTotal == True, matches the ```__ALL__``` total for the work.

In [54]:
# Compute totals and compare with __ALL__ values
# Compute totals using only included values
total_sizes = size_data.query('IncludedInTotal == True')[['Text','Total Words']].groupby(['Text']).sum()
total_sizes.reset_index(inplace=True)
total_sizes.columns=['Text','Aggregated Total']

# Extract the __ALL__ values
__all__sizes = size_data.query('Section == \'__ALL__\'')[['Text','Total Words']]
__all__sizes.columns=['Text', '__ALL__ Words']
compare_sizes = total_sizes.join(__all__sizes.set_index(['Text']), on=['Text'])
compare_sizes['Diff'] = compare_sizes['Aggregated Total'] != compare_sizes['__ALL__ Words']
compare_sizes

Unnamed: 0,Text,Aggregated Total,__ALL__ Words,Diff
0,Apostolic Fathers Greek,66264,66264,False
1,Athanasius Greek,146844,146844,False
2,Basil Greek,25899,25899,False
3,Epictetus Greek,80305,80305,False
4,Gospel of Thomas (Greek),2665,2665,False
5,Gregory Greek,9747,9747,False
6,Gregory of Nyssa (Greek),20359,20359,False
7,Josephus Greek,468505,468505,False
8,LXX Rahlfs Tagged,587477,587477,False
9,LXX Swete,583665,583665,False
