# "Predict Translation Word And Char Count (Part 1)"

> Prediction of translated Word or Char Count is used as a Quality or Validation Check
- toc: true
- branch: master
- badges: false
- comments: true
- hide: false
- search_exclude: true
- metadata_key1: metadata_value1
- metadata_key2: metadata_value2
- image: images/PredictTranslationWordAndCharCount_1.png
- categories: [Deep Learning,Regression,   Python,fastai]
- show_tags: true

In [None]:
#hide
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive/"
base_dir = root_dir + 'TODANALYTICS/'
# base_dir = ""

Mounted at /content/gdrive


## Purpose
Machine translation is used increasingly to lighten the load of human translators. The critical component here is the *translation engine* which is a model that takes a sequence of source words and outputs another sequence of translated words. To train such a model many thousands of sentence pairs need to be aligned for training examples.

There is a relationship between the number of words (and the number of characters) in the source language and the target language. If this relationship can be established and captured in yet another model, such a model can be helpful in at least two ways:

* For training: Validate the alignment of two sentences (in a training example) by comparing their *word size* and/or *character size*
* For inference: Validate the *word size* and/or *character size* of a translated/proofread sentence

The purpose of this project is to discover such a model for a variety of languages and to evaluate its use in the above roles.

## Dataset and Variables
The dataset comes in the form of *contributions*, each captures in one of 167289 rows or data-points. Each contribution is a sentence that could be in the source language (always English) or a translation of the source sentence. There could be many variations/versions of a translated sentence, including the version provided by the translation engine initially. Human proofreaders then provide their own corrections in the form of other versions.

There are 4 kinds of contributions:

* E: English contributions
* T: Translate contributions - provided by the translation engine
* C: Create contributions - corrections provided by human proofreaders/translators
* V: Vote contributions - whenever a human proofreader/translator indicates agreement with a contribution provided by the translation engine, it is recorded in the form of a vote contribution

The features of the dataset are:

* m_descriptor: Unique identifier of a document
* t_lan: Language of the translation (English is also considered a translation)
* t_senc: Number of sentences in a document
* t_version: Version of a translation
* s_typ: Type of the sentence
* s_rsen: Number of a sentence within a document
* e_id: Database primary key of a contribution's content
* e_top: Content of the contribution that got the most votes
* be_id: N/A
* be_top: N/A
* c_id: Database primary key of a contribution
* c_created_at: Creation time of a contribution
* c_kind: Kind of a contribution
* c_eis: N/A
* c_base: N/A
* a_role: N/A
* u_name: N/A
* e_content: Text content of a contribution
* chars: Number of characters in a contribution
* words: Number of words in a contribution

In this notebook we will only prepare the dataset. Exploratory data analysis as well as modeling will occur in followup notebooks.

# Setup the Environment

In [None]:
# 
# # ! pip install fastai
# ! pip install fastai2
# ! pip install nbdev

!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

In [None]:
# 
# ! pip list | grep fastai
! pip list | grep fastai2

In [None]:
from fastai.tabular.all import *
from fastbook import *

# from fastai.tabular.all import *
# # from fastai2.tabular.all import *

In [None]:
import seaborn as sns
%matplotlib inline

In [None]:
!python --version

Python 3.6.9


In [None]:
# 
# PATH='./'
PATH = Path(base_dir + './'); #PATH

# Get train/valid data

Next we will ingest all the data we need. Note that the content (e_content) for each contribution is not displayed as it often makes the presentation unwieldy.

In [None]:
def display_all(df):
    with pd.option_context("display.max_rows", 1000, "display.max_columns", 1000): 
        display(df)

## Ingest all E-contributions

In [None]:
#hide
# df_E = pd.read_csv(f'{PATH}/contributions/1955-0807y_Pride_ENG_15-0401-b_E-contributions.csv', sep='~')
# df_E = pd.read_csv(f'{PATH}/contributions/1965-0829_SatansEden_ENG_15-1104-b_E-contributions.csv', sep='~')
# df_E

In [None]:
all_files = glob.glob(f"{PATH}/contributions/*E-contributions.csv")
li = []
for filename in all_files:
    dft = pd.read_csv(filename, index_col=None, header=0, sep='~')
    li.append(dft)
df_E = pd.concat(li, axis=0, ignore_index=True)
df_E.iloc[:5,:-2]

Unnamed: 0,m_descriptor,t_lan,t_senc,t_version,s_typ,s_rsen,e_id,e_top,be_id,be_top,c_id,c_created_at,c_kind,c_eis,c_base,a_role
0,1965-0418x,ENG,1870,18-0101-E1R,n,1,174684,Z,,,224461,2018-03-29 23:10:24.573038,E,0,,EP
1,1965-0418x,ENG,1870,18-0101-E1R,n,2,174685,Z,,,224462,2018-03-29 23:10:24.595501,E,0,,EP
2,1965-0418x,ENG,1870,18-0101-E1R,n,3,174686,Z,,,224463,2018-03-29 23:10:24.628362,E,0,,EP
3,1965-0418x,ENG,1870,18-0101-E1R,n,4,174687,Z,,,224464,2018-03-29 23:10:24.650119,E,0,,EP
4,1965-0418x,ENG,1870,18-0101-E1R,n,5,174688,Z,,,224465,2018-03-29 23:10:24.670806,E,0,,EP


In [None]:
#hide
df_E

Unnamed: 0,m_descriptor,t_lan,t_senc,t_version,s_typ,s_rsen,e_id,e_top,be_id,be_top,c_id,c_created_at,c_kind,c_eis,c_base,a_role,u_name,e_content
0,1965-0418x,ENG,1870,18-0101-E1R,n,1,174684,Z,,,224461,2018-03-29 23:10:24.573038,E,0,,EP,kobest,Let us bow our heads.
1,1965-0418x,ENG,1870,18-0101-E1R,n,2,174685,Z,,,224462,2018-03-29 23:10:24.595501,E,0,,EP,kobest,"Lord, as we gather here this fine Easter morning, see the little buds pressing their way out, the bees flying in and getting their portion, the birds singing like their hearts would burst with joy, because there is an Easter."
2,1965-0418x,ENG,1870,18-0101-E1R,n,3,174686,Z,,,224463,2018-03-29 23:10:24.628362,E,0,,EP,kobest,"We believe that You raised up Jesus from the dead, many years ago, today, and we celebrate this memorial day."
3,1965-0418x,ENG,1870,18-0101-E1R,n,4,174687,Z,,,224464,2018-03-29 23:10:24.650119,E,0,,EP,kobest,"And let there come an Easter among us all, today."
4,1965-0418x,ENG,1870,18-0101-E1R,n,5,174688,Z,,,224465,2018-03-29 23:10:24.670806,E,0,,EP,kobest,"May we, as His servants, understand His Word, that we were in His fellowship then, and that now that we are risen with Him and setting together in Heavenly places."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
167284,CAB-06,ENG,835,18-1101-b,n,831,493574,Z,,,707747,2019-08-31 17:09:28.899814,E,0,,EP,kobest,What else could we desire above Jesus Himself?
167285,CAB-06,ENG,835,18-1101-b,n,832,493575,Z,,,707748,2019-08-31 17:09:28.916381,E,0,,EP,kobest,"Is He not everything, even Perfect Everything?"
167286,CAB-06,ENG,835,18-1101-b,n,833,493576,Z,,,707749,2019-08-31 17:09:28.933088,E,0,,EP,kobest,He that hath an ear let him hear what the Spirit saith to the churches.
167287,CAB-06,ENG,835,18-1101-b,n,834,493577,Z,,,707750,2019-08-31 17:09:28.949515,E,0,,EP,kobest,Amen.


In [None]:
df_E = df_E.drop(['e_id','t_senc','s_typ','e_top','be_id','be_top','c_created_at','c_kind','c_eis','c_base','a_role','u_name'], axis=1) #each record now unique
df_E.iloc[:5,:-1]

Unnamed: 0,m_descriptor,t_lan,t_version,s_rsen,c_id
0,1965-0418x,ENG,18-0101-E1R,1,224461
1,1965-0418x,ENG,18-0101-E1R,2,224462
2,1965-0418x,ENG,18-0101-E1R,3,224463
3,1965-0418x,ENG,18-0101-E1R,4,224464
4,1965-0418x,ENG,18-0101-E1R,5,224465


In [None]:
#hide
df_E

Unnamed: 0,m_descriptor,t_lan,t_version,s_rsen,c_id,e_content
0,1965-0418x,ENG,18-0101,1,224461,Let us bow our heads.
1,1965-0418x,ENG,18-0101,2,224462,"Lord, as we gather here this fine Easter morning, see the little buds pressing their way out, the bees flying in and getting their portion, the birds singing like their hearts would burst with joy, because there is an Easter."
2,1965-0418x,ENG,18-0101,3,224463,"We believe that You raised up Jesus from the dead, many years ago, today, and we celebrate this memorial day."
3,1965-0418x,ENG,18-0101,4,224464,"And let there come an Easter among us all, today."
4,1965-0418x,ENG,18-0101,5,224465,"May we, as His servants, understand His Word, that we were in His fellowship then, and that now that we are risen with Him and setting together in Heavenly places."
...,...,...,...,...,...,...
167284,CAB-06,ENG,18-1101,831,707747,What else could we desire above Jesus Himself?
167285,CAB-06,ENG,18-1101,832,707748,"Is He not everything, even Perfect Everything?"
167286,CAB-06,ENG,18-1101,833,707749,He that hath an ear let him hear what the Spirit saith to the churches.
167287,CAB-06,ENG,18-1101,834,707750,Amen.


In [None]:
# 
#handle NaNs in e_content
e_content_nans = df_E['e_content'].isna()
df_E[e_content_nans]

Unnamed: 0,m_descriptor,t_lan,t_version,s_rsen,c_id,e_content
33415,1956-0805,ENG,15-0402,1176,454335,
43018,1957-0419,ENG,15-0401,505,13306,


In [None]:
# 
#replace e_content NaNs with empty strings
df_E.loc[e_content_nans, 'e_content'] = ''
# df_E.loc[e_content_nans, ['e_content']]
# OR
df_E[df_E['e_content']=='']

Unnamed: 0,m_descriptor,t_lan,t_version,s_rsen,c_id,e_content
33415,1956-0805,ENG,15-0402,1176,454335,
43018,1957-0419,ENG,15-0401,505,13306,


In [None]:
# 
#add chars column
df_E['chars'] = [len(e) for e in df_E['e_content']]
# df_E['chars'] = [len(e) if type(e)==str else 1 for e in df_E['e_content']]
df_E.loc[:5,['m_descriptor','t_lan','t_version','s_rsen','c_id','chars']]

Unnamed: 0,m_descriptor,t_lan,t_version,s_rsen,c_id,chars
0,1965-0418x,ENG,18-0101,1,224461,21
1,1965-0418x,ENG,18-0101,2,224462,225
2,1965-0418x,ENG,18-0101,3,224463,109
3,1965-0418x,ENG,18-0101,4,224464,49
4,1965-0418x,ENG,18-0101,5,224465,163
5,1965-0418x,ENG,18-0101,6,224466,96


In [None]:
#hide
df_E

Unnamed: 0,m_descriptor,t_lan,t_version,s_rsen,c_id,e_content,chars
0,1965-0418x,ENG,18-0101,1,224461,Let us bow our heads.,21
1,1965-0418x,ENG,18-0101,2,224462,"Lord, as we gather here this fine Easter morning, see the little buds pressing their way out, the bees flying in and getting their portion, the birds singing like their hearts would burst with joy, because there is an Easter.",225
2,1965-0418x,ENG,18-0101,3,224463,"We believe that You raised up Jesus from the dead, many years ago, today, and we celebrate this memorial day.",109
3,1965-0418x,ENG,18-0101,4,224464,"And let there come an Easter among us all, today.",49
4,1965-0418x,ENG,18-0101,5,224465,"May we, as His servants, understand His Word, that we were in His fellowship then, and that now that we are risen with Him and setting together in Heavenly places.",163
...,...,...,...,...,...,...,...
167284,CAB-06,ENG,18-1101,831,707747,What else could we desire above Jesus Himself?,46
167285,CAB-06,ENG,18-1101,832,707748,"Is He not everything, even Perfect Everything?",46
167286,CAB-06,ENG,18-1101,833,707749,He that hath an ear let him hear what the Spirit saith to the churches.,71
167287,CAB-06,ENG,18-1101,834,707750,Amen.,5


In [None]:
# 
# df_E.loc[e_content_nans, ['e_content','chars']]
# OR
df_E[df_E['chars']==0]

Unnamed: 0,m_descriptor,t_lan,t_version,s_rsen,c_id,e_content,chars
33415,1956-0805,ENG,15-0402,1176,454335,,0
43018,1957-0419,ENG,15-0401,505,13306,,0


In [None]:
# 
#add words column
#https://www.geeksforgeeks.org/python-program-to-count-words-in-a-sentence/
df_E['words'] = [len(re.findall(r'\w+', e)) for e in df_E['e_content']]
df_E.loc[:5,['m_descriptor','t_lan','t_version','s_rsen','c_id','chars','words']]

Unnamed: 0,m_descriptor,t_lan,t_version,s_rsen,c_id,chars,words
0,1965-0418x,ENG,18-0101,1,224461,21,5
1,1965-0418x,ENG,18-0101,2,224462,225,40
2,1965-0418x,ENG,18-0101,3,224463,109,20
3,1965-0418x,ENG,18-0101,4,224464,49,10
4,1965-0418x,ENG,18-0101,5,224465,163,30
5,1965-0418x,ENG,18-0101,6,224466,96,17


In [None]:
#hide
df_E

Unnamed: 0,m_descriptor,t_lan,t_version,s_rsen,c_id,e_content,chars,words
0,1965-0418x,ENG,18-0101,1,224461,Let us bow our heads.,21,5
1,1965-0418x,ENG,18-0101,2,224462,"Lord, as we gather here this fine Easter morning, see the little buds pressing their way out, the bees flying in and getting their portion, the birds singing like their hearts would burst with joy, because there is an Easter.",225,40
2,1965-0418x,ENG,18-0101,3,224463,"We believe that You raised up Jesus from the dead, many years ago, today, and we celebrate this memorial day.",109,20
3,1965-0418x,ENG,18-0101,4,224464,"And let there come an Easter among us all, today.",49,10
4,1965-0418x,ENG,18-0101,5,224465,"May we, as His servants, understand His Word, that we were in His fellowship then, and that now that we are risen with Him and setting together in Heavenly places.",163,30
...,...,...,...,...,...,...,...,...
167284,CAB-06,ENG,18-1101,831,707747,What else could we desire above Jesus Himself?,46,8
167285,CAB-06,ENG,18-1101,832,707748,"Is He not everything, even Perfect Everything?",46,7
167286,CAB-06,ENG,18-1101,833,707749,He that hath an ear let him hear what the Spirit saith to the churches.,71,15
167287,CAB-06,ENG,18-1101,834,707750,Amen.,5,1


In [None]:
# 
#remove BER part of version from t_version so that we can use this column to join the English contributions with their matching translated contributions
df_E['t_version'] = ['-'.join(e.split('-')[:2]) for e in df_E['t_version']]
df_E.loc[:5,['m_descriptor','t_lan','t_version','s_rsen','c_id','chars','words']]

Unnamed: 0,m_descriptor,t_lan,t_version,s_rsen,c_id,chars,words
0,1965-0418x,ENG,18-0101,1,224461,21,5
1,1965-0418x,ENG,18-0101,2,224462,225,40
2,1965-0418x,ENG,18-0101,3,224463,109,20
3,1965-0418x,ENG,18-0101,4,224464,49,10
4,1965-0418x,ENG,18-0101,5,224465,163,30
5,1965-0418x,ENG,18-0101,6,224466,96,17


In [None]:
#hide
df_E

Unnamed: 0,m_descriptor,t_lan,t_version,s_rsen,c_id,e_content,chars,words
0,1965-0418x,ENG,18-0101,1,224461,Let us bow our heads.,21,5
1,1965-0418x,ENG,18-0101,2,224462,"Lord, as we gather here this fine Easter morning, see the little buds pressing their way out, the bees flying in and getting their portion, the birds singing like their hearts would burst with joy, because there is an Easter.",225,40
2,1965-0418x,ENG,18-0101,3,224463,"We believe that You raised up Jesus from the dead, many years ago, today, and we celebrate this memorial day.",109,20
3,1965-0418x,ENG,18-0101,4,224464,"And let there come an Easter among us all, today.",49,10
4,1965-0418x,ENG,18-0101,5,224465,"May we, as His servants, understand His Word, that we were in His fellowship then, and that now that we are risen with Him and setting together in Heavenly places.",163,30
...,...,...,...,...,...,...,...,...
167284,CAB-06,ENG,18-1101,831,707747,What else could we desire above Jesus Himself?,46,8
167285,CAB-06,ENG,18-1101,832,707748,"Is He not everything, even Perfect Everything?",46,7
167286,CAB-06,ENG,18-1101,833,707749,He that hath an ear let him hear what the Spirit saith to the churches.,71,15
167287,CAB-06,ENG,18-1101,834,707750,Amen.,5,1


## Ingest all V-contributions

In [None]:
#hide
# df_V = pd.read_csv(f'{PATH}/contributions/1955-0807y_Pride_CHN_15-0401-h_V-contributions.csv', sep='~')
# df_V = pd.read_csv(f'{PATH}/contributions/1965-0829_SatansEden_FIJ_15-1104-B123_V-contributions.csv', sep='~')
# df_V

In [None]:
all_files = glob.glob(f"{PATH}/contributions/*V-contributions.csv")
li = []
for filename in all_files:
    dft = pd.read_csv(filename, index_col=None, header=0, sep='~')
    li.append(dft)
df_V = pd.concat(li, axis=0, ignore_index=True)
df_V.iloc[:5,:-2]

Unnamed: 0,m_descriptor,t_lan,t_senc,t_version,s_typ,s_rsen,e_id,e_top,be_id,be_top,c_id,c_created_at,c_kind,c_eis,c_base,a_role
0,1965-0418x,AFR,1870,18-0101-B123E1R,n,1,181444,M,181444.0,M,844713,2020-01-15 02:13:34.847562,V,11,a,CE
1,1965-0418x,AFR,1870,18-0101-B123E1R,n,1,181444,M,181444.0,M,256723,2018-04-23 11:04:31.787641,V,28,a,TE
2,1965-0418x,AFR,1870,18-0101-B123E1R,n,2,339948,T,200635.0,N,468379,2019-01-30 22:21:29.62162,V,0,c,CE
3,1965-0418x,AFR,1870,18-0101-B123E1R,n,2,200635,N,181445.0,N,256725,2018-04-23 11:23:43.781013,V,0,c,TE
4,1965-0418x,AFR,1870,18-0101-B123E1R,n,3,200637,M,200636.0,N,256727,2018-04-23 11:26:37.965897,V,0,c,TE


In [None]:
#hide
df_V

Unnamed: 0,m_descriptor,t_lan,t_senc,t_version,s_typ,s_rsen,e_id,e_top,be_id,be_top,c_id,c_created_at,c_kind,c_eis,c_base,a_role,u_name,e_content
0,1965-0418x,AFR,1870,18-0101-B123E1R,n,1,181444,M,181444.0,M,844713,2020-01-15 02:13:34.847562,V,11,a,CE,engest,Laat ons ons hoofde buig.
1,1965-0418x,AFR,1870,18-0101-B123E1R,n,1,181444,M,181444.0,M,256723,2018-04-23 11:04:31.787641,V,28,a,TE,linoli,Laat ons ons hoofde buig.
2,1965-0418x,AFR,1870,18-0101-B123E1R,n,2,339948,T,200635.0,N,468379,2019-01-30 22:21:29.62162,V,0,c,CE,engest,"Here, soos ons hier vergader op hierdie mooi Paasfees oggend, sien die botsels uitloop, die bye wat in vlieg en hulle gedeelte kry, die voëls wat sing asof hulle harte wil bars van vreugde, omdat daar 'n Paasfees is."
3,1965-0418x,AFR,1870,18-0101-B123E1R,n,2,200635,N,181445.0,N,256725,2018-04-23 11:23:43.781013,V,0,c,TE,linoli,"Here, soos ons hier vergader op hierdie mooi Paasfees oggend, sien die botsels uitloop, die bye wat in vlieg en hulle gedeelte kry, die voëls wat sing asof hulle harte wil bars van vreugde, omdat daar 'n Pase is."
4,1965-0418x,AFR,1870,18-0101-B123E1R,n,3,200637,M,200636.0,N,256727,2018-04-23 11:26:37.965897,V,0,c,TE,linoli,"Ons glo dat U Jesus opgewek het uit die dode, baie jare gelede, en vandag vier ons hierdie aandenking."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
382737,CAB-06,AFR,835,18-1101-B123,n,833,610356,M,610356.0,M,1061632,2020-05-25 02:44:38.590448,V,12,t,CE,engest,Hy wat 'n oor het laat hom hoor wat die Gees aan die gemeentes sê.
382738,CAB-06,AFR,835,18-1101-B123,n,834,610357,M,610357.0,M,1061633,2020-05-25 02:44:42.21845,V,3,t,CE,engest,Amen.
382739,CAB-06,AFR,835,18-1101-B123,n,834,610357,M,494412.0,N,874368,2020-02-02 00:24:14.79957,V,0,c,TE,tilvan,Amen.
382740,CAB-06,AFR,835,18-1101-B123,n,835,610358,M,610358.0,M,1061634,2020-05-25 02:44:56.991446,V,14,t,CE,engest,"Selfs so, Here God, deur U Gees, laat ons U waarheid hoor."


In [None]:
#hide
# print(df_V['be_top'].unique())
# df_V[df_V['be_top'].isna()] #are these due to a bug???
# df_V[df_V['e_top'].isna()]

In [None]:
df_V = df_V.drop(['t_senc','s_typ','be_id','c_eis'], axis=1)
df_V.iloc[:5,:-2]

Unnamed: 0,m_descriptor,t_lan,t_version,s_rsen,e_id,e_top,be_top,c_id,c_created_at,c_kind,c_base,a_role,u_name
0,1965-0418x,AFR,18-0101-B123E1R,1,181444,M,M,844713,2020-01-15 02:13:34.847562,V,a,CE,engest
1,1965-0418x,AFR,18-0101-B123E1R,1,181444,M,M,256723,2018-04-23 11:04:31.787641,V,a,TE,linoli
2,1965-0418x,AFR,18-0101-B123E1R,2,339948,T,N,468379,2019-01-30 22:21:29.62162,V,c,CE,engest
3,1965-0418x,AFR,18-0101-B123E1R,2,200635,N,N,256725,2018-04-23 11:23:43.781013,V,c,TE,linoli
4,1965-0418x,AFR,18-0101-B123E1R,3,200637,M,N,256727,2018-04-23 11:26:37.965897,V,c,TE,linoli


In [None]:
#hide
df_V

Unnamed: 0,m_descriptor,t_lan,t_version,s_rsen,e_id,e_top,be_top,c_id,c_created_at,c_kind,c_base,a_role,u_name,e_content
0,1965-0418x,AFR,18-0101-B123E1R,1,181444,M,M,844713,2020-01-15 02:13:34.847562,V,a,CE,engest,Laat ons ons hoofde buig.
1,1965-0418x,AFR,18-0101-B123E1R,1,181444,M,M,256723,2018-04-23 11:04:31.787641,V,a,TE,linoli,Laat ons ons hoofde buig.
2,1965-0418x,AFR,18-0101-B123E1R,2,339948,T,N,468379,2019-01-30 22:21:29.62162,V,c,CE,engest,"Here, soos ons hier vergader op hierdie mooi Paasfees oggend, sien die botsels uitloop, die bye wat in vlieg en hulle gedeelte kry, die voëls wat sing asof hulle harte wil bars van vreugde, omdat daar 'n Paasfees is."
3,1965-0418x,AFR,18-0101-B123E1R,2,200635,N,N,256725,2018-04-23 11:23:43.781013,V,c,TE,linoli,"Here, soos ons hier vergader op hierdie mooi Paasfees oggend, sien die botsels uitloop, die bye wat in vlieg en hulle gedeelte kry, die voëls wat sing asof hulle harte wil bars van vreugde, omdat daar 'n Pase is."
4,1965-0418x,AFR,18-0101-B123E1R,3,200637,M,N,256727,2018-04-23 11:26:37.965897,V,c,TE,linoli,"Ons glo dat U Jesus opgewek het uit die dode, baie jare gelede, en vandag vier ons hierdie aandenking."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
382737,CAB-06,AFR,18-1101-B123,833,610356,M,M,1061632,2020-05-25 02:44:38.590448,V,t,CE,engest,Hy wat 'n oor het laat hom hoor wat die Gees aan die gemeentes sê.
382738,CAB-06,AFR,18-1101-B123,834,610357,M,M,1061633,2020-05-25 02:44:42.21845,V,t,CE,engest,Amen.
382739,CAB-06,AFR,18-1101-B123,834,610357,M,N,874368,2020-02-02 00:24:14.79957,V,c,TE,tilvan,Amen.
382740,CAB-06,AFR,18-1101-B123,835,610358,M,M,1061634,2020-05-25 02:44:56.991446,V,t,CE,engest,"Selfs so, Here God, deur U Gees, laat ons U waarheid hoor."


In [None]:
#hide
#verify all TE/CE
# df_V[df_V['a_role']=='TE']
# df_V[df_V['a_role']=='CE']

In [None]:
#hide
#keep only top edits
# df_V['e_top'].unique()
df_V[~df_V['e_top'].isin(['M','T'])] #show others first before reassigning df

Unnamed: 0,m_descriptor,t_lan,t_version,s_rsen,e_id,e_top,be_top,c_id,c_created_at,c_kind,c_base,a_role,u_name,e_content
3,1965-0418x,AFR,18-0101-B123E1R,2,200635,N,N,256725,2018-04-23 11:23:43.781013,V,c,TE,linoli,"Here, soos ons hier vergader op hierdie mooi Paasfees oggend, sien die botsels uitloop, die bye wat in vlieg en hulle gedeelte kry, die voëls wat sing asof hulle harte wil bars van vreugde, omdat daar 'n Pase is."
7,1965-0418x,AFR,18-0101-B123E1R,4,200638,N,N,256730,2018-04-23 11:31:35.950348,V,c,TE,linoli,"En laat daar tussen ons almal ,'n Pase kom vandag."
8,1965-0418x,AFR,18-0101-B123E1R,5,200639,N,N,256732,2018-04-23 11:43:02.365621,V,c,TE,linoli,"Mag ons, as Sy diensknegte, Sy Woord verstaan, dat ons in Sy gemeenskap was, en dat ons nou saam met Hom opgestaan het en saam sit in Hemelse plekke."
14,1965-0418x,AFR,18-0101-B123E1R,8,200640,N,N,256736,2018-04-23 11:45:23.928111,V,c,TE,linoli,"Mag dit ook ’n Pase wees vir hulle, en ’n eksodus van siekte tot krag."
23,1965-0418x,AFR,18-0101-B123E1R,12,200641,N,N,256741,2018-04-23 11:52:51.171064,V,c,TE,linoli,"Ek beskou hierdie beslis as ’n wonderlike voorreg, vanmôre om terug te wees hier in Jeffersonville, Indiana, met hierdie groot gemeente, die kerk gepak en staan in en om buitekant, in die parkeer area en oral."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
382705,CAB-06,AFR,18-1101-B123,817,610338,N,N,874332,2020-02-02 00:04:25.814695,V,c,TE,tilvan,Hulle word ook sterre genoem omdat hulle 'houers' van lig is in die aandtyd.
382706,CAB-06,AFR,18-1101-B123,818,610339,N,N,874334,2020-02-02 00:04:55.254216,V,c,TE,tilvan,"Daarom in die donker van sonde, bring hulle die lig van God na Sy mense."
382724,CAB-06,AFR,18-1101-B123,827,610348,N,N,874352,2020-02-02 00:16:15.338023,V,c,TE,tilvan,Hy sal nie die sterre gebruik (boodskappers) om lig te gee in duisternis nie.
382728,CAB-06,AFR,18-1101-B123,829,610350,N,N,874356,2020-02-02 00:18:14.532886,V,c,TE,tilvan,Dit is die môrester wat sigbaar is wanneer die lig van die son begin skyn.


In [None]:
# 
#keep only top edits
# df_V[df_V['e_top'].isin(['M','T'])] #Majority & Tie
df_V = df_V[df_V['e_top'].isin(['M','T'])] #Majority & Tie
df_V.iloc[:5,:-2]

Unnamed: 0,m_descriptor,t_lan,t_version,s_rsen,e_id,e_top,be_top,c_id,c_created_at,c_kind,c_base,a_role
0,1965-0418x,AFR,18-0101-B123E1R,1,181444,M,M,844713,2020-01-15 02:13:34.847562,V,a,CE
1,1965-0418x,AFR,18-0101-B123E1R,1,181444,M,M,256723,2018-04-23 11:04:31.787641,V,a,TE
2,1965-0418x,AFR,18-0101-B123E1R,2,339948,T,N,468379,2019-01-30 22:21:29.62162,V,c,CE
4,1965-0418x,AFR,18-0101-B123E1R,3,200637,M,N,256727,2018-04-23 11:26:37.965897,V,c,TE
5,1965-0418x,AFR,18-0101-B123E1R,3,200637,M,M,468380,2019-01-30 22:21:51.780404,V,t,CE


In [None]:
#hide
df_V

Unnamed: 0,m_descriptor,t_lan,t_version,s_rsen,e_id,e_top,be_top,c_id,c_created_at,c_kind,c_base,a_role,u_name,e_content
0,1965-0418x,AFR,18-0101-B123E1R,1,181444,M,M,844713,2020-01-15 02:13:34.847562,V,a,CE,engest,Laat ons ons hoofde buig.
1,1965-0418x,AFR,18-0101-B123E1R,1,181444,M,M,256723,2018-04-23 11:04:31.787641,V,a,TE,linoli,Laat ons ons hoofde buig.
2,1965-0418x,AFR,18-0101-B123E1R,2,339948,T,N,468379,2019-01-30 22:21:29.62162,V,c,CE,engest,"Here, soos ons hier vergader op hierdie mooi Paasfees oggend, sien die botsels uitloop, die bye wat in vlieg en hulle gedeelte kry, die voëls wat sing asof hulle harte wil bars van vreugde, omdat daar 'n Paasfees is."
4,1965-0418x,AFR,18-0101-B123E1R,3,200637,M,N,256727,2018-04-23 11:26:37.965897,V,c,TE,linoli,"Ons glo dat U Jesus opgewek het uit die dode, baie jare gelede, en vandag vier ons hierdie aandenking."
5,1965-0418x,AFR,18-0101-B123E1R,3,200637,M,M,468380,2019-01-30 22:21:51.780404,V,t,CE,engest,"Ons glo dat U Jesus opgewek het uit die dode, baie jare gelede, en vandag vier ons hierdie aandenking."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
382737,CAB-06,AFR,18-1101-B123,833,610356,M,M,1061632,2020-05-25 02:44:38.590448,V,t,CE,engest,Hy wat 'n oor het laat hom hoor wat die Gees aan die gemeentes sê.
382738,CAB-06,AFR,18-1101-B123,834,610357,M,M,1061633,2020-05-25 02:44:42.21845,V,t,CE,engest,Amen.
382739,CAB-06,AFR,18-1101-B123,834,610357,M,N,874368,2020-02-02 00:24:14.79957,V,c,TE,tilvan,Amen.
382740,CAB-06,AFR,18-1101-B123,835,610358,M,M,1061634,2020-05-25 02:44:56.991446,V,t,CE,engest,"Selfs so, Here God, deur U Gees, laat ons U waarheid hoor."


In [None]:
tmp = df_V.sort_values(by=['m_descriptor', 't_lan','t_version','s_rsen','c_created_at'])

In [None]:
#hide
tmp

Unnamed: 0,m_descriptor,t_lan,t_version,s_rsen,e_id,e_top,be_top,c_id,c_created_at,c_kind,c_base,a_role,u_name,e_content
6405,1948-0304,GER,15-0902-B123,1,464917,M,,662736,2019-07-30 13:44:14.904495,V,c,TE,hugmes,"Gehen Sie jetzt nicht von hier weg und sagen Sie: „Bruder Branham sagte, es sind zehn Jahre bis zur Entrückung."""
6406,1948-0304,GER,15-0902-B123,2,456140,M,M,662737,2019-07-30 13:44:34.151158,V,a,TE,hugmes,Ich weiß nicht; niemand tut.
6407,1948-0304,GER,15-0902-B123,3,456141,M,M,662738,2019-07-30 13:44:45.966924,V,a,TE,hugmes,Nicht einmal die Engel des Himmels kennt.
6408,1948-0304,GER,15-0902-B123,4,456142,M,M,662739,2019-07-30 13:44:53.096079,V,a,TE,hugmes,Ich weiß es nicht.
6409,1948-0304,GER,15-0902-B123,5,457312,M,N,662741,2019-07-30 13:45:56.611909,V,c,TE,hugmes,"Aber ich weiß, dass es in der Nähe schrecklich wird, denn die Anzeichen, die Er sagte, würden stattfinden."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
382737,CAB-06,AFR,18-1101-B123,833,610356,M,M,1061632,2020-05-25 02:44:38.590448,V,t,CE,engest,Hy wat 'n oor het laat hom hoor wat die Gees aan die gemeentes sê.
382739,CAB-06,AFR,18-1101-B123,834,610357,M,N,874368,2020-02-02 00:24:14.79957,V,c,TE,tilvan,Amen.
382738,CAB-06,AFR,18-1101-B123,834,610357,M,M,1061633,2020-05-25 02:44:42.21845,V,t,CE,engest,Amen.
382741,CAB-06,AFR,18-1101-B123,835,610358,M,N,874370,2020-02-02 00:24:42.429672,V,c,TE,tilvan,"Selfs so, Here God, deur U Gees, laat ons U waarheid hoor."


In [None]:
#hide
del tmp

In [None]:
df_V = df_V.groupby(['m_descriptor', 't_lan','t_version','s_rsen']).agg({'e_top':'last', 'be_top':'last', 'c_created_at':['last','count'], 'c_kind':'last', 'c_base':'last', 'a_role':'last', 'u_name':'last', 'e_content':'last'})
df_V.iloc[:,:-2]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,e_top,be_top,c_created_at,c_created_at,c_kind,c_base,a_role
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,last,last,last,count,last,last,last
m_descriptor,t_lan,t_version,s_rsen,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
1948-0304,GER,15-0902-B123,1,M,,2019-07-30 13:44:14.904495,1,V,c,TE
1948-0304,GER,15-0902-B123,2,M,M,2019-07-30 13:44:34.151158,1,V,a,TE
1948-0304,GER,15-0902-B123,3,M,M,2019-07-30 13:44:45.966924,1,V,a,TE
1948-0304,GER,15-0902-B123,4,M,M,2019-07-30 13:44:53.096079,1,V,a,TE
1948-0304,GER,15-0902-B123,5,M,N,2019-07-30 13:45:56.611909,1,V,c,TE
...,...,...,...,...,...,...,...,...,...,...
CAB-06,AFR,18-1101-B123,831,T,N,2020-05-25 02:43:59.956802,1,V,c,CE
CAB-06,AFR,18-1101-B123,832,M,M,2020-05-25 02:44:26.239919,2,V,t,CE
CAB-06,AFR,18-1101-B123,833,M,M,2020-05-25 02:44:38.590448,2,V,t,CE
CAB-06,AFR,18-1101-B123,834,M,N,2020-02-02 00:24:14.79957,2,V,c,TE


In [None]:
#hide
df_V

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,e_top,be_top,c_created_at,c_created_at,c_kind,c_base,a_role,u_name,e_content
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,last,last,last,count,last,last,last,last,last
m_descriptor,t_lan,t_version,s_rsen,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
1948-0304,GER,15-0902-B123,1,M,,2019-07-30 13:44:14.904495,1,V,c,TE,hugmes,"Gehen Sie jetzt nicht von hier weg und sagen Sie: „Bruder Branham sagte, es sind zehn Jahre bis zur Entrückung."""
1948-0304,GER,15-0902-B123,2,M,M,2019-07-30 13:44:34.151158,1,V,a,TE,hugmes,Ich weiß nicht; niemand tut.
1948-0304,GER,15-0902-B123,3,M,M,2019-07-30 13:44:45.966924,1,V,a,TE,hugmes,Nicht einmal die Engel des Himmels kennt.
1948-0304,GER,15-0902-B123,4,M,M,2019-07-30 13:44:53.096079,1,V,a,TE,hugmes,Ich weiß es nicht.
1948-0304,GER,15-0902-B123,5,M,N,2019-07-30 13:45:56.611909,1,V,c,TE,hugmes,"Aber ich weiß, dass es in der Nähe schrecklich wird, denn die Anzeichen, die Er sagte, würden stattfinden."
...,...,...,...,...,...,...,...,...,...,...,...,...
CAB-06,AFR,18-1101-B123,831,T,N,2020-05-25 02:43:59.956802,1,V,c,CE,engest,Wat anders kan ons begeer behalwe Jesus Homself?
CAB-06,AFR,18-1101-B123,832,M,M,2020-05-25 02:44:26.239919,2,V,t,CE,engest,"Is Hy nie alles nie, selfs Volmaak Alles?"
CAB-06,AFR,18-1101-B123,833,M,M,2020-05-25 02:44:38.590448,2,V,t,CE,engest,Hy wat 'n oor het laat hom hoor wat die Gees aan die gemeentes sê.
CAB-06,AFR,18-1101-B123,834,M,N,2020-02-02 00:24:14.79957,2,V,c,TE,tilvan,Amen.


In [None]:
# 
# use T-contributions.csv to verify that all sentences have votes (i.e. no red ones left)

In [None]:
#hide
# df_T = pd.read_csv(f'{PATH}/contributions/1955-0807y_Pride_CHN_15-0401-h_T-contributions.csv', sep='~')
# df_T = pd.read_csv(f'{PATH}/contributions/1965-0829_SatansEden_FIJ_15-1104-B123_T-contributions.csv', sep='~')
# df_T

In [None]:
all_files = glob.glob(f"{PATH}/contributions/*T-contributions.csv")
li = []
for filename in all_files:
    dft = pd.read_csv(filename, index_col=None, header=0, sep='~')
    li.append(dft)
df_T = pd.concat(li, axis=0, ignore_index=True)
df_T.iloc[:5,:-2]

Unnamed: 0,m_descriptor,t_lan,t_senc,t_version,s_typ,s_rsen,e_id,e_top,be_id,be_top,c_id,c_created_at,c_kind,c_eis,c_base,a_role
0,1965-0418x,AFR,1870,18-0101-B123E1R,n,1,181444,M,,,231225,2018-03-30 13:05:50.489319,T,0,,MT
1,1965-0418x,AFR,1870,18-0101-B123E1R,n,2,181445,N,,,231226,2018-03-30 13:05:50.524888,T,0,,MT
2,1965-0418x,AFR,1870,18-0101-B123E1R,n,3,181446,N,,,231227,2018-03-30 13:05:50.56683,T,0,,MT
3,1965-0418x,AFR,1870,18-0101-B123E1R,n,4,181447,N,,,231228,2018-03-30 13:05:50.601543,T,0,,MT
4,1965-0418x,AFR,1870,18-0101-B123E1R,n,5,181448,N,,,231229,2018-03-30 13:05:50.635029,T,0,,MT


In [None]:
#hide
df_T

Unnamed: 0,m_descriptor,t_lan,t_senc,t_version,s_typ,s_rsen,e_id,e_top,be_id,be_top,c_id,c_created_at,c_kind,c_eis,c_base,a_role,u_name,e_content
0,1965-0418x,AFR,1870,18-0101-B123E1R,n,1,181444,M,,,231225,2018-03-30 13:05:50.489319,T,0,,MT,todmo-2.0.0,Laat ons ons hoofde buig.
1,1965-0418x,AFR,1870,18-0101-B123E1R,n,2,181445,N,,,231226,2018-03-30 13:05:50.524888,T,0,,MT,todmo-2.0.0,"Here, soos ons hier vergader hierdie mooi Paasoggend, sien die knoppies druk op hulle manier uit, die bye vlieg in en kry hulle gedeelte, die voëls het gesing soos hulle harte bars met vreugde, want daar is ’n Pase."
2,1965-0418x,AFR,1870,18-0101-B123E1R,n,3,181446,N,,,231227,2018-03-30 13:05:50.56683,T,0,,MT,todmo-2.0.0,"Ons glo dat U Jesus opgewek het uit die dood, baie jare gelede, vandag, en ons vier hierdie aandenking dag."
3,1965-0418x,AFR,1870,18-0101-B123E1R,n,4,181447,N,,,231228,2018-03-30 13:05:50.601543,T,0,,MT,todmo-2.0.0,"En laat daar kom ’n Pase tussen ons almal, vandag."
4,1965-0418x,AFR,1870,18-0101-B123E1R,n,5,181448,N,,,231229,2018-03-30 13:05:50.635029,T,0,,MT,todmo-2.0.0,"Mag ons, as Sy diensknegte, verstaan, dat ons Sy Woord was in Sy gemeenskap dan, en dit wat ons nou opgestaan saam met Hom en saam sit in Hemelse plekke."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
248384,CAB-06,AFR,835,18-1101-B123,n,831,494409,N,,,708583,2019-08-31 17:18:46.983416,T,0,,MT,todnt,Waarskuwings wat anders kon ons bokant Jesus Self begeer?
248385,CAB-06,AFR,835,18-1101-B123,n,832,494410,N,,,708584,2019-08-31 17:18:46.998927,T,0,,MT,todnt,"Sagmoedig Is Hy nie alles nie, selfs Volmaak Alles?"
248386,CAB-06,AFR,835,18-1101-B123,n,833,494411,N,,,708585,2019-08-31 17:18:47.017601,T,0,,MT,todnt,Sagmoedig Hy wat 'n oor laat hoor het wat die Gees aan die gemeentes sê.
248387,CAB-06,AFR,835,18-1101-B123,n,834,494412,N,,,708586,2019-08-31 17:18:47.034374,T,0,,MT,todnt,Vice heiliges.


In [None]:
assert len(df_V)==len(df_T), f"df_V has different length from df_T: Maybe there are sentences without any votes (red ones)!. This means there are contributions such that df_T['e_top']=='Z'"

In [None]:
# 
#IF PREVIOUS ASSERTION FAILS: See if there are: if so, go vote for them and run this notebook again. This is unusual because each translation's CE should have voted (i.e. signed off on) for ALL sentences!!!
df_T[df_T['e_top']=='Z']

Unnamed: 0,m_descriptor,t_lan,t_senc,t_version,s_typ,s_rsen,e_id,e_top,be_id,be_top,c_id,c_created_at,c_kind,c_eis,c_base,a_role,u_name,e_content


In [None]:
df_T[~df_T['e_top'].isin(['M','T','Z','N'])]

Unnamed: 0,m_descriptor,t_lan,t_senc,t_version,s_typ,s_rsen,e_id,e_top,be_id,be_top,c_id,c_created_at,c_kind,c_eis,c_base,a_role,u_name,e_content


In [None]:
df_V = df_V.reset_index()
df_V.iloc[:5,:-2]

Unnamed: 0_level_0,m_descriptor,t_lan,t_version,s_rsen,e_top,be_top,c_created_at,c_created_at,c_kind,c_base,a_role
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,last,last,last,count,last,last,last
0,1948-0304,GER,15-0902-B123,1,M,,2019-07-30 13:44:14.904495,1,V,c,TE
1,1948-0304,GER,15-0902-B123,2,M,M,2019-07-30 13:44:34.151158,1,V,a,TE
2,1948-0304,GER,15-0902-B123,3,M,M,2019-07-30 13:44:45.966924,1,V,a,TE
3,1948-0304,GER,15-0902-B123,4,M,M,2019-07-30 13:44:53.096079,1,V,a,TE
4,1948-0304,GER,15-0902-B123,5,M,N,2019-07-30 13:45:56.611909,1,V,c,TE


In [None]:
#hide
df_V

Unnamed: 0_level_0,m_descriptor,t_lan,t_version,s_rsen,e_top,be_top,c_created_at,c_created_at,c_kind,c_base,a_role,u_name,e_content
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,last,last,last,count,last,last,last,last,last
0,1948-0304,GER,15-0902-B123,1,M,,2019-07-30 13:44:14.904495,1,V,c,TE,hugmes,"Gehen Sie jetzt nicht von hier weg und sagen Sie: „Bruder Branham sagte, es sind zehn Jahre bis zur Entrückung."""
1,1948-0304,GER,15-0902-B123,2,M,M,2019-07-30 13:44:34.151158,1,V,a,TE,hugmes,Ich weiß nicht; niemand tut.
2,1948-0304,GER,15-0902-B123,3,M,M,2019-07-30 13:44:45.966924,1,V,a,TE,hugmes,Nicht einmal die Engel des Himmels kennt.
3,1948-0304,GER,15-0902-B123,4,M,M,2019-07-30 13:44:53.096079,1,V,a,TE,hugmes,Ich weiß es nicht.
4,1948-0304,GER,15-0902-B123,5,M,N,2019-07-30 13:45:56.611909,1,V,c,TE,hugmes,"Aber ich weiß, dass es in der Nähe schrecklich wird, denn die Anzeichen, die Er sagte, würden stattfinden."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
248384,CAB-06,AFR,18-1101-B123,831,T,N,2020-05-25 02:43:59.956802,1,V,c,CE,engest,Wat anders kan ons begeer behalwe Jesus Homself?
248385,CAB-06,AFR,18-1101-B123,832,M,M,2020-05-25 02:44:26.239919,2,V,t,CE,engest,"Is Hy nie alles nie, selfs Volmaak Alles?"
248386,CAB-06,AFR,18-1101-B123,833,M,M,2020-05-25 02:44:38.590448,2,V,t,CE,engest,Hy wat 'n oor het laat hom hoor wat die Gees aan die gemeentes sê.
248387,CAB-06,AFR,18-1101-B123,834,M,N,2020-02-02 00:24:14.79957,2,V,c,TE,tilvan,Amen.


In [None]:
df_V.columns

MultiIndex([('m_descriptor',      ''),
            (       't_lan',      ''),
            (   't_version',      ''),
            (      's_rsen',      ''),
            (       'e_top',  'last'),
            (      'be_top',  'last'),
            ('c_created_at',  'last'),
            ('c_created_at', 'count'),
            (      'c_kind',  'last'),
            (      'c_base',  'last'),
            (      'a_role',  'last'),
            (      'u_name',  'last'),
            (   'e_content',  'last')],
           )

In [None]:
#
#rename columns
df_V.columns = ['m_descriptor','t_lan','t_version','s_rsen','e_top','be_top','c_created_at','c_created_at_count','c_kind','c_base','a_role','u_name','e_content']

In [None]:
#hide
df_V

Unnamed: 0,m_descriptor,t_lan,t_version,s_rsen,e_top,be_top,c_created_at,c_created_at_count,c_kind,c_base,a_role,u_name,e_content
0,1948-0304,GER,15-0902-B123,1,M,,2019-07-30 13:44:14.904495,1,V,c,TE,hugmes,"Gehen Sie jetzt nicht von hier weg und sagen Sie: „Bruder Branham sagte, es sind zehn Jahre bis zur Entrückung."""
1,1948-0304,GER,15-0902-B123,2,M,M,2019-07-30 13:44:34.151158,1,V,a,TE,hugmes,Ich weiß nicht; niemand tut.
2,1948-0304,GER,15-0902-B123,3,M,M,2019-07-30 13:44:45.966924,1,V,a,TE,hugmes,Nicht einmal die Engel des Himmels kennt.
3,1948-0304,GER,15-0902-B123,4,M,M,2019-07-30 13:44:53.096079,1,V,a,TE,hugmes,Ich weiß es nicht.
4,1948-0304,GER,15-0902-B123,5,M,N,2019-07-30 13:45:56.611909,1,V,c,TE,hugmes,"Aber ich weiß, dass es in der Nähe schrecklich wird, denn die Anzeichen, die Er sagte, würden stattfinden."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
248384,CAB-06,AFR,18-1101-B123,831,T,N,2020-05-25 02:43:59.956802,1,V,c,CE,engest,Wat anders kan ons begeer behalwe Jesus Homself?
248385,CAB-06,AFR,18-1101-B123,832,M,M,2020-05-25 02:44:26.239919,2,V,t,CE,engest,"Is Hy nie alles nie, selfs Volmaak Alles?"
248386,CAB-06,AFR,18-1101-B123,833,M,M,2020-05-25 02:44:38.590448,2,V,t,CE,engest,Hy wat 'n oor het laat hom hoor wat die Gees aan die gemeentes sê.
248387,CAB-06,AFR,18-1101-B123,834,M,N,2020-02-02 00:24:14.79957,2,V,c,TE,tilvan,Amen.


In [None]:
#hide
df_V[~df_V['e_top'].isin(['M','T','Z','N'])]

Unnamed: 0,m_descriptor,t_lan,t_version,s_rsen,e_top,be_top,c_created_at,c_created_at_count,c_kind,c_base,a_role,u_name,e_content


In [None]:
#hide
len(df_V[df_V['a_role']=='CE']) + len(df_V[df_V['a_role']=='TE']) + len(df_V[df_V['a_role']=='QE']) + len(df_V[df_V['a_role']=='LA'])

248389

In [None]:
#hide
len( df_V[~df_V['a_role'].isin(['CE','TE','QE'])] ) #seems like votes by LA do NOT show in TODPROOF - bug??

165

In [None]:
#hide
len(df_V[df_V['a_role']=='LA'])
# len(df_V[df_V['a_role']=='EP'])

165

In [None]:
#hide
df_V[~df_V['a_role'].isin(['CE','TE','QE','LA'])]

Unnamed: 0,m_descriptor,t_lan,t_version,s_rsen,e_top,be_top,c_created_at,c_created_at_count,c_kind,c_base,a_role,u_name,e_content


In [None]:
# 
#handle NaNs in e_content
e_content_nans = df_V['e_content'].isna()

In [None]:
#hide
df_V[e_content_nans]

Unnamed: 0,m_descriptor,t_lan,t_version,s_rsen,e_top,be_top,c_created_at,c_created_at_count,c_kind,c_base,a_role,u_name,e_content
15724,1953-0405s,CHN,19-0201-h,1102,M,M,2019-06-11 09:12:28.088006,1,V,a,TE,dawnxu,
15775,1953-0405s,CHN,19-0201-h,1153,M,M,2019-06-15 03:25:11.855078,1,V,a,TE,dawnxu,
30871,1955-1118,CHN,15-0401-h,18,M,M,2018-01-06 19:20:40.907983,1,V,a,TE,estzhe,
33361,1956-0805,CHN,15-0402-h,1176,M,M,2019-07-24 00:12:14.971666,1,V,a,TE,corche,
34819,1957-0114,CHN,19-0701-h,1204,M,M,2020-05-13 03:47:44.384779,1,V,a,TE,dawnxu,
35746,1957-0120x,CHN,19-0701-h,360,M,M,2020-02-01 06:08:24.434038,1,V,a,TE,dawnxu,
42964,1957-0419,BEM,15-0401-B123,505,M,M,2019-05-31 07:05:46.896578,2,V,a,CE,marmwa,
55587,1960-0607,CHN,19-0401-h,123,M,M,2019-08-20 13:16:51.415398,1,V,a,TE,dawnxu,
55749,1960-0607,CHN,19-0401-h,285,M,M,2019-08-27 01:45:11.773081,1,V,a,TE,dawnxu,
56098,1960-0607,CHN,19-0401-h,634,M,M,2019-09-01 02:36:08.515518,1,V,a,TE,dawnxu,


In [None]:
# 
#replace e_content NaNs with empty strings
df_V.loc[e_content_nans, 'e_content'] = ''

In [None]:
#hide
# df_V.loc[e_content_nans, ['e_content']]
# OR
df_V[df_V['e_content']=='']

Unnamed: 0,m_descriptor,t_lan,t_version,s_rsen,e_top,be_top,c_created_at,c_created_at_count,c_kind,c_base,a_role,u_name,e_content
15724,1953-0405s,CHN,19-0201-h,1102,M,M,2019-06-11 09:12:28.088006,1,V,a,TE,dawnxu,
15775,1953-0405s,CHN,19-0201-h,1153,M,M,2019-06-15 03:25:11.855078,1,V,a,TE,dawnxu,
30871,1955-1118,CHN,15-0401-h,18,M,M,2018-01-06 19:20:40.907983,1,V,a,TE,estzhe,
33361,1956-0805,CHN,15-0402-h,1176,M,M,2019-07-24 00:12:14.971666,1,V,a,TE,corche,
34819,1957-0114,CHN,19-0701-h,1204,M,M,2020-05-13 03:47:44.384779,1,V,a,TE,dawnxu,
35746,1957-0120x,CHN,19-0701-h,360,M,M,2020-02-01 06:08:24.434038,1,V,a,TE,dawnxu,
42964,1957-0419,BEM,15-0401-B123,505,M,M,2019-05-31 07:05:46.896578,2,V,a,CE,marmwa,
55587,1960-0607,CHN,19-0401-h,123,M,M,2019-08-20 13:16:51.415398,1,V,a,TE,dawnxu,
55749,1960-0607,CHN,19-0401-h,285,M,M,2019-08-27 01:45:11.773081,1,V,a,TE,dawnxu,
56098,1960-0607,CHN,19-0401-h,634,M,M,2019-09-01 02:36:08.515518,1,V,a,TE,dawnxu,


In [None]:
# 
#add chars column
df_V['chars'] = [len(e) for e in df_V['e_content']] #TypeError: object of type 'float' has no len()
# df_V['chars'] = [len(e) if type(e)==str else 1 for e in df_V['e_content']]
df_V.loc[:5,['m_descriptor','t_lan','t_version','s_rsen','e_top','be_top','c_created_at','c_created_at_count','c_kind','c_base','a_role','chars']]

Unnamed: 0,m_descriptor,t_lan,t_version,s_rsen,e_top,be_top,c_created_at,c_created_at_count,c_kind,c_base,a_role,chars
0,1948-0304,GER,15-0902-B123,1,M,,2019-07-30 13:44:14.904495,1,V,c,TE,112
1,1948-0304,GER,15-0902-B123,2,M,M,2019-07-30 13:44:34.151158,1,V,a,TE,28
2,1948-0304,GER,15-0902-B123,3,M,M,2019-07-30 13:44:45.966924,1,V,a,TE,41
3,1948-0304,GER,15-0902-B123,4,M,M,2019-07-30 13:44:53.096079,1,V,a,TE,18
4,1948-0304,GER,15-0902-B123,5,M,N,2019-07-30 13:45:56.611909,1,V,c,TE,106
5,1948-0304,GER,15-0902-B123,6,M,N,2019-07-30 13:46:34.354399,1,V,c,TE,36


In [None]:
#hide
df_V

Unnamed: 0,m_descriptor,t_lan,t_version,s_rsen,e_top,be_top,c_created_at,c_created_at_count,c_kind,c_base,a_role,u_name,e_content,chars
0,1948-0304,GER,15-0902-B123,1,M,,2019-07-30 13:44:14.904495,1,V,c,TE,hugmes,"Gehen Sie jetzt nicht von hier weg und sagen Sie: „Bruder Branham sagte, es sind zehn Jahre bis zur Entrückung.""",112
1,1948-0304,GER,15-0902-B123,2,M,M,2019-07-30 13:44:34.151158,1,V,a,TE,hugmes,Ich weiß nicht; niemand tut.,28
2,1948-0304,GER,15-0902-B123,3,M,M,2019-07-30 13:44:45.966924,1,V,a,TE,hugmes,Nicht einmal die Engel des Himmels kennt.,41
3,1948-0304,GER,15-0902-B123,4,M,M,2019-07-30 13:44:53.096079,1,V,a,TE,hugmes,Ich weiß es nicht.,18
4,1948-0304,GER,15-0902-B123,5,M,N,2019-07-30 13:45:56.611909,1,V,c,TE,hugmes,"Aber ich weiß, dass es in der Nähe schrecklich wird, denn die Anzeichen, die Er sagte, würden stattfinden.",106
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
248384,CAB-06,AFR,18-1101-B123,831,T,N,2020-05-25 02:43:59.956802,1,V,c,CE,engest,Wat anders kan ons begeer behalwe Jesus Homself?,48
248385,CAB-06,AFR,18-1101-B123,832,M,M,2020-05-25 02:44:26.239919,2,V,t,CE,engest,"Is Hy nie alles nie, selfs Volmaak Alles?",41
248386,CAB-06,AFR,18-1101-B123,833,M,M,2020-05-25 02:44:38.590448,2,V,t,CE,engest,Hy wat 'n oor het laat hom hoor wat die Gees aan die gemeentes sê.,66
248387,CAB-06,AFR,18-1101-B123,834,M,N,2020-02-02 00:24:14.79957,2,V,c,TE,tilvan,Amen.,5


In [None]:
#hide 
# df_V.loc[e_content_nans, ['e_content','chars']]
# OR
df_V[df_V['chars']==0]

Unnamed: 0,m_descriptor,t_lan,t_version,s_rsen,e_top,be_top,c_created_at,c_created_at_count,c_kind,c_base,a_role,u_name,e_content,chars
15724,1953-0405s,CHN,19-0201-h,1102,M,M,2019-06-11 09:12:28.088006,1,V,a,TE,dawnxu,,0
15775,1953-0405s,CHN,19-0201-h,1153,M,M,2019-06-15 03:25:11.855078,1,V,a,TE,dawnxu,,0
30871,1955-1118,CHN,15-0401-h,18,M,M,2018-01-06 19:20:40.907983,1,V,a,TE,estzhe,,0
33361,1956-0805,CHN,15-0402-h,1176,M,M,2019-07-24 00:12:14.971666,1,V,a,TE,corche,,0
34819,1957-0114,CHN,19-0701-h,1204,M,M,2020-05-13 03:47:44.384779,1,V,a,TE,dawnxu,,0
35746,1957-0120x,CHN,19-0701-h,360,M,M,2020-02-01 06:08:24.434038,1,V,a,TE,dawnxu,,0
42964,1957-0419,BEM,15-0401-B123,505,M,M,2019-05-31 07:05:46.896578,2,V,a,CE,marmwa,,0
55587,1960-0607,CHN,19-0401-h,123,M,M,2019-08-20 13:16:51.415398,1,V,a,TE,dawnxu,,0
55749,1960-0607,CHN,19-0401-h,285,M,M,2019-08-27 01:45:11.773081,1,V,a,TE,dawnxu,,0
56098,1960-0607,CHN,19-0401-h,634,M,M,2019-09-01 02:36:08.515518,1,V,a,TE,dawnxu,,0


In [None]:
# 
#add words column
#https://www.geeksforgeeks.org/python-program-to-count-words-in-a-sentence/
df_V['words'] = [len(re.findall(r'\w+', e)) for e in df_V['e_content']]
df_V.loc[:5,['m_descriptor','t_lan','t_version','s_rsen','e_top','be_top','c_created_at','c_created_at_count','c_kind','c_base','a_role','chars','words']]

Unnamed: 0,m_descriptor,t_lan,t_version,s_rsen,e_top,be_top,c_created_at,c_created_at_count,c_kind,c_base,a_role,chars,words
0,1948-0304,GER,15-0902-B123,1,M,,2019-07-30 13:44:14.904495,1,V,c,TE,112,20
1,1948-0304,GER,15-0902-B123,2,M,M,2019-07-30 13:44:34.151158,1,V,a,TE,28,5
2,1948-0304,GER,15-0902-B123,3,M,M,2019-07-30 13:44:45.966924,1,V,a,TE,41,7
3,1948-0304,GER,15-0902-B123,4,M,M,2019-07-30 13:44:53.096079,1,V,a,TE,18,4
4,1948-0304,GER,15-0902-B123,5,M,N,2019-07-30 13:45:56.611909,1,V,c,TE,106,18
5,1948-0304,GER,15-0902-B123,6,M,N,2019-07-30 13:46:34.354399,1,V,c,TE,36,8


In [None]:
#hide
df_V

Unnamed: 0,m_descriptor,t_lan,t_version,s_rsen,e_top,be_top,c_created_at,c_created_at_count,c_kind,c_base,a_role,u_name,e_content,chars,words
0,1948-0304,GER,15-0902-B123,1,M,,2019-07-30 13:44:14.904495,1,V,c,TE,hugmes,"Gehen Sie jetzt nicht von hier weg und sagen Sie: „Bruder Branham sagte, es sind zehn Jahre bis zur Entrückung.""",112,20
1,1948-0304,GER,15-0902-B123,2,M,M,2019-07-30 13:44:34.151158,1,V,a,TE,hugmes,Ich weiß nicht; niemand tut.,28,5
2,1948-0304,GER,15-0902-B123,3,M,M,2019-07-30 13:44:45.966924,1,V,a,TE,hugmes,Nicht einmal die Engel des Himmels kennt.,41,7
3,1948-0304,GER,15-0902-B123,4,M,M,2019-07-30 13:44:53.096079,1,V,a,TE,hugmes,Ich weiß es nicht.,18,4
4,1948-0304,GER,15-0902-B123,5,M,N,2019-07-30 13:45:56.611909,1,V,c,TE,hugmes,"Aber ich weiß, dass es in der Nähe schrecklich wird, denn die Anzeichen, die Er sagte, würden stattfinden.",106,18
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
248384,CAB-06,AFR,18-1101-B123,831,T,N,2020-05-25 02:43:59.956802,1,V,c,CE,engest,Wat anders kan ons begeer behalwe Jesus Homself?,48,8
248385,CAB-06,AFR,18-1101-B123,832,M,M,2020-05-25 02:44:26.239919,2,V,t,CE,engest,"Is Hy nie alles nie, selfs Volmaak Alles?",41,8
248386,CAB-06,AFR,18-1101-B123,833,M,M,2020-05-25 02:44:38.590448,2,V,t,CE,engest,Hy wat 'n oor het laat hom hoor wat die Gees aan die gemeentes sê.,66,15
248387,CAB-06,AFR,18-1101-B123,834,M,N,2020-02-02 00:24:14.79957,2,V,c,TE,tilvan,Amen.,5,1


In [None]:
#hide 
# df_V[df_V['words']==0]

In [None]:
# 
#remove BER part from t_version; allows for joining to English contributions
df_V['t_version'] = ['-'.join(e.split('-')[:2]) for e in df_V['t_version']]
df_V.loc[:5,['m_descriptor','t_lan','t_version','s_rsen','e_top','be_top','c_created_at','c_created_at_count','c_kind','c_base','a_role','chars','words']]

Unnamed: 0,m_descriptor,t_lan,t_version,s_rsen,e_top,be_top,c_created_at,c_created_at_count,c_kind,c_base,a_role,chars,words
0,1948-0304,GER,15-0902,1,M,,2019-07-30 13:44:14.904495,1,V,c,TE,112,20
1,1948-0304,GER,15-0902,2,M,M,2019-07-30 13:44:34.151158,1,V,a,TE,28,5
2,1948-0304,GER,15-0902,3,M,M,2019-07-30 13:44:45.966924,1,V,a,TE,41,7
3,1948-0304,GER,15-0902,4,M,M,2019-07-30 13:44:53.096079,1,V,a,TE,18,4
4,1948-0304,GER,15-0902,5,M,N,2019-07-30 13:45:56.611909,1,V,c,TE,106,18
5,1948-0304,GER,15-0902,6,M,N,2019-07-30 13:46:34.354399,1,V,c,TE,36,8


In [None]:
#hide
df_V

Unnamed: 0,m_descriptor,t_lan,t_version,s_rsen,e_top,be_top,c_created_at,c_created_at_count,c_kind,c_base,a_role,u_name,e_content,chars,words
0,1948-0304,GER,15-0902,1,M,,2019-07-30 13:44:14.904495,1,V,c,TE,hugmes,"Gehen Sie jetzt nicht von hier weg und sagen Sie: „Bruder Branham sagte, es sind zehn Jahre bis zur Entrückung.""",112,20
1,1948-0304,GER,15-0902,2,M,M,2019-07-30 13:44:34.151158,1,V,a,TE,hugmes,Ich weiß nicht; niemand tut.,28,5
2,1948-0304,GER,15-0902,3,M,M,2019-07-30 13:44:45.966924,1,V,a,TE,hugmes,Nicht einmal die Engel des Himmels kennt.,41,7
3,1948-0304,GER,15-0902,4,M,M,2019-07-30 13:44:53.096079,1,V,a,TE,hugmes,Ich weiß es nicht.,18,4
4,1948-0304,GER,15-0902,5,M,N,2019-07-30 13:45:56.611909,1,V,c,TE,hugmes,"Aber ich weiß, dass es in der Nähe schrecklich wird, denn die Anzeichen, die Er sagte, würden stattfinden.",106,18
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
248384,CAB-06,AFR,18-1101,831,T,N,2020-05-25 02:43:59.956802,1,V,c,CE,engest,Wat anders kan ons begeer behalwe Jesus Homself?,48,8
248385,CAB-06,AFR,18-1101,832,M,M,2020-05-25 02:44:26.239919,2,V,t,CE,engest,"Is Hy nie alles nie, selfs Volmaak Alles?",41,8
248386,CAB-06,AFR,18-1101,833,M,M,2020-05-25 02:44:38.590448,2,V,t,CE,engest,Hy wat 'n oor het laat hom hoor wat die Gees aan die gemeentes sê.,66,15
248387,CAB-06,AFR,18-1101,834,M,N,2020-02-02 00:24:14.79957,2,V,c,TE,tilvan,Amen.,5,1


# Merge E and V contributions

In [None]:
df_joind_EV = pd.merge(df_E, df_V, how='inner', on=['m_descriptor', 't_version', 's_rsen'], suffixes=('_E', '_V'), sort=True)
df_joind_EV.loc[:5,['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','e_top','be_top','c_created_at','c_created_at_count','c_kind','c_base','a_role','chars_V','words_V']]

Unnamed: 0,m_descriptor,t_lan_E,t_version,s_rsen,c_id,chars_E,words_E,t_lan_V,e_top,be_top,c_created_at,c_created_at_count,c_kind,c_base,a_role,chars_V,words_V
0,1948-0304,ENG,15-0902,1,660286,98,21,GER,M,,2019-07-30 13:44:14.904495,1,V,c,TE,112,20
1,1948-0304,ENG,15-0902,1,660286,98,21,POR,M,M,2020-05-25 18:11:14.358631,2,V,t,QE,104,18
2,1948-0304,ENG,15-0902,2,660287,27,6,GER,M,M,2019-07-30 13:44:34.151158,1,V,a,TE,28,5
3,1948-0304,ENG,15-0902,2,660287,27,6,POR,M,M,2020-05-25 18:11:30.820954,3,V,a,QE,25,5
4,1948-0304,ENG,15-0902,3,660288,36,7,GER,M,M,2019-07-30 13:44:45.966924,1,V,a,TE,41,7
5,1948-0304,ENG,15-0902,3,660288,36,7,POR,M,M,2020-05-25 18:11:42.224382,2,V,t,QE,32,7


In [None]:
#hide
df_joind_EV
# df_joined_EV[60000:70000]

Unnamed: 0,m_descriptor,t_lan_E,t_version,s_rsen,c_id,e_content_E,chars_E,words_E,t_lan_V,e_top,be_top,c_created_at,c_created_at_count,c_kind,c_base,a_role,u_name,e_content_V,chars_V,words_V
0,1948-0304,ENG,15-0902,1,660286,"Don't no one go away from here now, and say, ""Brother Branham said it's ten years to the rapture.""",98,21,GER,M,,2019-07-30 13:44:14.904495,1,V,c,TE,hugmes,"Gehen Sie jetzt nicht von hier weg und sagen Sie: „Bruder Branham sagte, es sind zehn Jahre bis zur Entrückung.""",112,20
1,1948-0304,ENG,15-0902,1,660286,"Don't no one go away from here now, and say, ""Brother Branham said it's ten years to the rapture.""",98,21,POR,M,M,2020-05-25 18:11:14.358631,2,V,t,QE,kobes2,"Ninguém vá embora daqui agora, e diga: ""O irmão Branham disse que faltam dez anos para o arrebatamento.""",104,18
2,1948-0304,ENG,15-0902,2,660287,I do not know; nobody does.,27,6,GER,M,M,2019-07-30 13:44:34.151158,1,V,a,TE,hugmes,Ich weiß nicht; niemand tut.,28,5
3,1948-0304,ENG,15-0902,2,660287,I do not know; nobody does.,27,6,POR,M,M,2020-05-25 18:11:30.820954,3,V,a,QE,kobes2,Eu não sei; ninguém sabe.,25,5
4,1948-0304,ENG,15-0902,3,660288,Not even the Angels of heaven knows.,36,7,GER,M,M,2019-07-30 13:44:45.966924,1,V,a,TE,hugmes,Nicht einmal die Engel des Himmels kennt.,41,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
248384,CAB-06,ENG,18-1101,831,707747,What else could we desire above Jesus Himself?,46,8,AFR,T,N,2020-05-25 02:43:59.956802,1,V,c,CE,engest,Wat anders kan ons begeer behalwe Jesus Homself?,48,8
248385,CAB-06,ENG,18-1101,832,707748,"Is He not everything, even Perfect Everything?",46,7,AFR,M,M,2020-05-25 02:44:26.239919,2,V,t,CE,engest,"Is Hy nie alles nie, selfs Volmaak Alles?",41,8
248386,CAB-06,ENG,18-1101,833,707749,He that hath an ear let him hear what the Spirit saith to the churches.,71,15,AFR,M,M,2020-05-25 02:44:38.590448,2,V,t,CE,engest,Hy wat 'n oor het laat hom hoor wat die Gees aan die gemeentes sê.,66,15
248387,CAB-06,ENG,18-1101,834,707750,Amen.,5,1,AFR,M,N,2020-02-02 00:24:14.79957,2,V,c,TE,tilvan,Amen.,5,1


# Save prepared data to file

In [None]:
df_joind_EV.to_csv (f'{PATH}/PredictTranslationWordAndCharCount_1-output.csv', sep='~', index = False, header=True)