# "Translation Word/Char Count Prediction (Part 3c)"

> Languages IBO, IND, LIN, LUA, LUG, POB
- toc: true
- branch: master
- badges: false
- comments: true
- hide: false
- search_exclude: true
- metadata_key1: metadata_value1
- metadata_key2: metadata_value2
- image: images/PredictTranslationWordAndCharCount_3.png
- categories: [Translation_Industry,  Regression,  Python,pandas,plotly]
- show_tags: true

In [None]:
#hide
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive/"
base_dir = root_dir + 'TODANALYTICS/'
# base_dir = ""

Mounted at /content/gdrive


## Purpose
There is a relationship between the number of words (and the number of characters) in the source language and the target language. If this relationship can be established and captured in yet another model, such a model can be helpful in at least two ways:

* For training: Validate the alignment of two sentences (in a training example) by comparing their *word size* and/or *character size*
* For inference: Validate the *word size* and/or *character size* of a translated/proofread sentence

In this notebook we will continue to discover models for each language and to evaluate its use in the above roles.

## Dataset and Variables

The dataset used in this notebook contains the following features:

* m_descriptor: Unique identifier of a document
* t_lan_E: Language of the translation (English is also considered a translation)
* t_version: Version of a translation
* s_rsen: Number of a sentence within a document
* c_id: Database primary key of a contribution
* e_content_E: Text content of an English contribution
* chars_E: Number of characters in an English contribution
* words_E: Number of words in an English contribution
* t_lan_V: Language of the translation
* e_top: N/A
* be_top: N/A
* c_created_at: Creation time of a contribution
* c_kind: Kind of a contribution
* c_base: N/A
* a_role: N/A
* u_name: N/A
* e_content_V: Text content of a translated contribution
* chars_V: Number of characters in a translated contribution
* words_V: Number of words in a translated contribution

# Setup the Environment

In [None]:
from pathlib import Path
import pandas as pd
import plotly.express as px
%matplotlib inline

In [None]:
!python --version

Python 3.6.9


In [None]:
# PATH='./'
PATH = Path(base_dir + './'); #PATH

## Language IBO

In [None]:
df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-IBO-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]

Unnamed: 0,m_descriptor,t_lan_E,t_version,s_rsen,c_id,chars_E,words_E,t_lan_V,chars_V,words_V
0,1965-1206,ENG,14-0901,1,3195,20,5,IBO,22,5
1,1965-1206,ENG,14-0901,2,3196,82,14,IBO,80,19
2,1965-1206,ENG,14-0901,3,3197,60,10,IBO,64,14
3,1965-1206,ENG,14-0901,4,3198,95,18,IBO,84,19
4,1965-1206,ENG,14-0901,5,3199,80,17,IBO,80,16
5,1965-1206,ENG,14-0901,6,3200,115,23,IBO,101,25


In [None]:
#hide
# df

### Characters

In [None]:
fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

In [None]:
#
#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1965-1206') & (df['s_rsen']==1120)]
outdf

Unnamed: 0,m_descriptor,t_lan_E,t_version,s_rsen,c_id,e_content_E,chars_E,words_E,t_lan_V,e_top,be_top,c_created_at,count,c_kind,c_base,a_role,u_name,e_content_V,chars_V,words_V
1119,1965-1206,ENG,14-0901,1120,4314,"Look, they laugh...",139,22,IBO,M,,2019-02-05 08:31...,1,V,c,TE,vicadi,"Lee, ha chìri ya...",206,47


In [None]:
pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])

Look, they laughed at him, called him “a wild, screaming, unlearned fanatic,” as usual, that prophet forerunning the first coming of Jesus.
Lee, ha chìri ya ọchì, kpọọ ya onye n'emebiga ihe oke bụ "mmadụ-ọhia, nke n'eti sọ mkpu, n'enweghi mmụta, onye ihe n'anụ-ọkụ n'obi", dika ọ na adi, na onye amụma ahụ bụ onye mbu-uzọ n'obibia Kraist nke mbu.


This data-point could be an outlier.

In [None]:
df = df.drop(1119)

In [None]:
#hide
fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'},
                 trendline="ols", trendline_color_override='black')
fig.show()
results = px.get_trendline_results(fig)
results.px_fit_results.iloc[0].summary()


pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.



0,1,2,3
Dep. Variable:,y,R-squared:,0.946
Model:,OLS,Adj. R-squared:,0.946
Method:,Least Squares,F-statistic:,30120.0
Date:,"Thu, 22 Oct 2020",Prob (F-statistic):,0.0
Time:,14:00:00,Log-Likelihood:,-6637.9
No. Observations:,1710,AIC:,13280.0
Df Residuals:,1708,BIC:,13290.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.5145,0.464,5.416,0.000,1.604,3.425
x1,0.9917,0.006,173.539,0.000,0.981,1.003

0,1,2,3
Omnibus:,140.319,Durbin-Watson:,1.994
Prob(Omnibus):,0.0,Jarque-Bera (JB):,598.331
Skew:,0.275,Prob(JB):,1.19e-130
Kurtosis:,5.845,Cond. No.,133.0


### Words

In [None]:
fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

In [None]:
#hide
fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'},
                 trendline="ols", 
                 trendline_color_override='black')
fig.show()
results = px.get_trendline_results(fig)
results.px_fit_results.iloc[0].summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.924
Model:,OLS,Adj. R-squared:,0.923
Method:,Least Squares,F-statistic:,20620.0
Date:,"Thu, 22 Oct 2020",Prob (F-statistic):,0.0
Time:,14:00:05,Log-Likelihood:,-4367.1
No. Observations:,1710,AIC:,8738.0
Df Residuals:,1708,BIC:,8749.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.4983,0.125,3.991,0.000,0.253,0.743
x1,1.1108,0.008,143.598,0.000,1.096,1.126

0,1,2,3
Omnibus:,175.297,Durbin-Watson:,1.969
Prob(Omnibus):,0.0,Jarque-Bera (JB):,595.211
Skew:,0.485,Prob(JB):,5.6400000000000004e-130
Kurtosis:,5.722,Cond. No.,26.8


In [None]:
df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-IBO-output.csv', sep='~', index = False, header=True)

## Language IND

In [None]:
df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-IND-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]

Unnamed: 0,m_descriptor,t_lan_E,t_version,s_rsen,c_id,chars_E,words_E,t_lan_V,chars_V,words_V
0,1961-0218,ENG,19-0201,1,492254,14,3,IND,16,3
1,1961-0218,ENG,19-0201,2,492255,11,2,IND,23,2
2,1961-0218,ENG,19-0201,3,492256,47,9,IND,45,7
3,1961-0218,ENG,19-0201,4,492257,60,13,IND,78,11
4,1961-0218,ENG,19-0201,5,492258,35,9,IND,38,7
5,1961-0218,ENG,19-0201,6,492259,29,5,IND,38,5


In [None]:
#hide
# df

### Characters

In [None]:
fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

In [None]:
# #
# #outlier
# pd.set_option('display.max_colwidth',20)
# outdf = df[(df['m_descriptor']=='1965-0822x') & (df['s_rsen']==1)]
# outdf

In [None]:
# pd.set_option('display.max_colwidth',1000)
# print(outdf.loc[:,['e_content_E']].values[0][0])
# print(outdf.loc[:,['e_content_V']].values[0][0])

In [None]:
#hide
fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'},
                 trendline="ols", trendline_color_override='black')
fig.show()
results = px.get_trendline_results(fig)
results.px_fit_results.iloc[0].summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.96
Model:,OLS,Adj. R-squared:,0.96
Method:,Least Squares,F-statistic:,68860.0
Date:,"Thu, 22 Oct 2020",Prob (F-statistic):,0.0
Time:,14:00:22,Log-Likelihood:,-11249.0
No. Observations:,2861,AIC:,22500.0
Df Residuals:,2859,BIC:,22510.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.5565,0.371,6.891,0.000,1.829,3.284
x1,1.2133,0.005,262.403,0.000,1.204,1.222

0,1,2,3
Omnibus:,187.434,Durbin-Watson:,2.009
Prob(Omnibus):,0.0,Jarque-Bera (JB):,712.307
Skew:,0.216,Prob(JB):,2.1100000000000002e-155
Kurtosis:,5.406,Cond. No.,129.0


### Words

In [None]:
fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

In [None]:
#hide
fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'},
                 trendline="ols", 
                 trendline_color_override='black')
fig.show()
results = px.get_trendline_results(fig)
results.px_fit_results.iloc[0].summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.954
Model:,OLS,Adj. R-squared:,0.954
Method:,Least Squares,F-statistic:,59690.0
Date:,"Thu, 22 Oct 2020",Prob (F-statistic):,0.0
Time:,14:00:25,Log-Likelihood:,-6110.8
No. Observations:,2861,AIC:,12230.0
Df Residuals:,2859,BIC:,12240.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-0.0500,0.063,-0.798,0.425,-0.173,0.073
x1,0.9575,0.004,244.319,0.000,0.950,0.965

0,1,2,3
Omnibus:,283.395,Durbin-Watson:,2.044
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1121.568
Skew:,0.427,Prob(JB):,2.8499999999999997e-244
Kurtosis:,5.946,Cond. No.,26.2


In [None]:
df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-IND-output.csv', sep='~', index = False, header=True)

## Language LIN

In [None]:
df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-LIN-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]

Unnamed: 0,m_descriptor,t_lan_E,t_version,s_rsen,c_id,chars_E,words_E,t_lan_V,chars_V,words_V
0,1960-0626,ENG,19-0201,1,520727,41,7,LIN,53,9
1,1960-0626,ENG,19-0201,2,520728,87,17,LIN,106,17
2,1960-0626,ENG,19-0201,3,520729,56,11,LIN,60,11
3,1960-0626,ENG,19-0201,4,520730,53,10,LIN,75,15
4,1960-0626,ENG,19-0201,5,520731,33,6,LIN,40,7
5,1960-0626,ENG,19-0201,6,520732,99,19,LIN,149,28


In [None]:
#hide
# df

### Characters

In [None]:
fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

In [None]:
#
#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1960-0626') & (df['s_rsen']==88)]
outdf

Unnamed: 0,m_descriptor,t_lan_E,t_version,s_rsen,c_id,e_content_E,chars_E,words_E,t_lan_V,e_top,be_top,c_created_at,count,c_kind,c_base,a_role,u_name,e_content_V,chars_V,words_V
87,1960-0626,ENG,19-0201,88,520814,You go all the w...,349,63,LIN,M,N,2020-06-14 00:25...,2,V,c,TE,chaona,Okokita boye kin...,463,79


In [None]:
pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])

You go all the way down the turnpike, and then about the same distance, or a little farther again, on over through, down by Brother Beeler’s place, and on through that city, and another city, and another city, and another city, to a little church that my grandfather built, a little Methodist church that I preached in twenty-five, thirty years ago.
Okokita boye kino na balabala esika bafutisaka mbongo mpo na koleka na ngambo mosusu, mpe na nsima okotambwisa pene na ntaka ndenge moko, to mwa mosika, okoleka pembeni na esika ya Ndeko Beeler, okokatisa engumba wana, mpe engumba mosusu, mpe engumba mosusu, mpe engumba mosusu lisusu, kino na mwa ndako na losambo moko oyo nkoko na ngai ya mobali atongaká, mwa ndako na losambo moko ya ba-Metodiste, nateya kuna eleki mibu ntuku mibale na mitano to ntuku misato.


This data-point could be an outlier.

In [None]:
df = df.drop(87)

In [None]:
#hide
fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'},
                 trendline="ols", trendline_color_override='black')
fig.show()
results = px.get_trendline_results(fig)
results.px_fit_results.iloc[0].summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.951
Method:,Least Squares,F-statistic:,201800.0
Date:,"Thu, 22 Oct 2020",Prob (F-statistic):,0.0
Time:,14:00:35,Log-Likelihood:,-40862.0
No. Observations:,10386,AIC:,81730.0
Df Residuals:,10384,BIC:,81740.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.8498,0.193,9.567,0.000,1.471,2.229
x1,1.1187,0.002,449.244,0.000,1.114,1.124

0,1,2,3
Omnibus:,1408.398,Durbin-Watson:,1.989
Prob(Omnibus):,0.0,Jarque-Bera (JB):,6942.709
Skew:,0.567,Prob(JB):,0.0
Kurtosis:,6.841,Cond. No.,124.0


### Words

In [None]:
fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

In [None]:
#hide
fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'},
                 trendline="ols", 
                 trendline_color_override='black')
fig.show()
results = px.get_trendline_results(fig)
results.px_fit_results.iloc[0].summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.923
Model:,OLS,Adj. R-squared:,0.923
Method:,Least Squares,F-statistic:,124900.0
Date:,"Thu, 22 Oct 2020",Prob (F-statistic):,0.0
Time:,14:00:38,Log-Likelihood:,-24930.0
No. Observations:,10386,AIC:,49860.0
Df Residuals:,10384,BIC:,49880.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-0.0831,0.042,-1.957,0.050,-0.166,0.000
x1,0.9737,0.003,353.456,0.000,0.968,0.979

0,1,2,3
Omnibus:,1808.822,Durbin-Watson:,2.006
Prob(Omnibus):,0.0,Jarque-Bera (JB):,9937.347
Skew:,0.727,Prob(JB):,0.0
Kurtosis:,7.566,Cond. No.,25.1


In [None]:
df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-LIN-output.csv', sep='~', index = False, header=True)

## Language LUA

In [None]:
df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-LUA-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]

Unnamed: 0,m_descriptor,t_lan_E,t_version,s_rsen,c_id,chars_E,words_E,t_lan_V,chars_V,words_V
0,1965-0801z,ENG,15-1101,1,134793,36,8,LUA,41,6
1,1965-0801z,ENG,15-1101,2,134794,205,38,LUA,246,42
2,1965-0801z,ENG,15-1101,3,134795,182,35,LUA,218,36
3,1965-0801z,ENG,15-1101,4,134796,26,5,LUA,32,4
4,1965-0801z,ENG,15-1101,5,134797,79,17,LUA,108,16
5,1965-0801z,ENG,15-1101,6,134798,104,21,LUA,132,20


In [None]:
#hide
# df

### Characters

In [None]:
fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

In [None]:
# #
# #outlier
# pd.set_option('display.max_colwidth',20)
# outdf = df[(df['m_descriptor']=='1965-0822x') & (df['s_rsen']==1)]
# outdf

In [None]:
# pd.set_option('display.max_colwidth',1000)
# print(outdf.loc[:,['e_content_E']].values[0][0])
# print(outdf.loc[:,['e_content_V']].values[0][0])

In [None]:
#hide
fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'},
                 trendline="ols", trendline_color_override='black')
fig.show()
results = px.get_trendline_results(fig)
results.px_fit_results.iloc[0].summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.938
Model:,OLS,Adj. R-squared:,0.938
Method:,Least Squares,F-statistic:,18130.0
Date:,"Thu, 22 Oct 2020",Prob (F-statistic):,0.0
Time:,14:00:47,Log-Likelihood:,-4949.4
No. Observations:,1197,AIC:,9903.0
Df Residuals:,1195,BIC:,9913.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.7456,0.707,8.126,0.000,4.358,7.133
x1,1.1005,0.008,134.636,0.000,1.084,1.117

0,1,2,3
Omnibus:,165.434,Durbin-Watson:,2.036
Prob(Omnibus):,0.0,Jarque-Bera (JB):,724.899
Skew:,0.579,Prob(JB):,3.89e-158
Kurtosis:,6.633,Cond. No.,140.0


### Words

In [None]:
fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

In [None]:
#hide
fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'},
                 trendline="ols", 
                 trendline_color_override='black')
fig.show()
results = px.get_trendline_results(fig)
results.px_fit_results.iloc[0].summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.922
Model:,OLS,Adj. R-squared:,0.922
Method:,Least Squares,F-statistic:,14180.0
Date:,"Thu, 22 Oct 2020",Prob (F-statistic):,0.0
Time:,14:00:52,Log-Likelihood:,-2909.4
No. Observations:,1197,AIC:,5823.0
Df Residuals:,1195,BIC:,5833.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.2803,0.131,2.135,0.033,0.023,0.538
x1,0.9355,0.008,119.095,0.000,0.920,0.951

0,1,2,3
Omnibus:,228.232,Durbin-Watson:,2.006
Prob(Omnibus):,0.0,Jarque-Bera (JB):,955.493
Skew:,0.847,Prob(JB):,3.2900000000000004e-208
Kurtosis:,7.036,Cond. No.,27.6


In [None]:
df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-LUA-output.csv', sep='~', index = False, header=True)

## Language LUG

In [None]:
df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-LUG-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]

Unnamed: 0,m_descriptor,t_lan_E,t_version,s_rsen,c_id,chars_E,words_E,t_lan_V,chars_V,words_V
0,1965-0219,ENG,15-1101,1,306022,79,16,LUG,99,16
1,1965-0219,ENG,15-1101,2,306023,144,22,LUG,159,25
2,1965-0219,ENG,15-1101,3,306024,83,14,LUG,63,9
3,1965-0219,ENG,15-1101,4,306025,215,37,LUG,199,29
4,1965-0219,ENG,15-1101,5,306026,47,9,LUG,50,8
5,1965-0219,ENG,15-1101,6,306027,161,28,LUG,153,21


In [None]:
#hide
# df

### Characters

In [None]:
fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

In [None]:
#
#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1965-0219') & (df['s_rsen']==87)]
outdf

Unnamed: 0,m_descriptor,t_lan_E,t_version,s_rsen,c_id,e_content_E,chars_E,words_E,t_lan_V,e_top,be_top,c_created_at,count,c_kind,c_base,a_role,u_name,e_content_V,chars_V,words_V
86,1965-0219,ENG,15-1101,87,306108,It was the last ...,120,23,LUG,M,N,2018-07-31 09:40...,1,V,c,TE,julmuk,Olukunngaana olw...,58,9


In [None]:
pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])

It was the last day, of the service that I was to speak at the International Convention of the Full Gospel Business Men.
Olukunngaana olw’ensi yonna olwa Full Gospel Business Men.


This data-point could be an outlier.

In [None]:
df = df.drop(86)

In [None]:
#hide
fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'},
                 trendline="ols", trendline_color_override='black')
fig.show()
results = px.get_trendline_results(fig)
results.px_fit_results.iloc[0].summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.957
Model:,OLS,Adj. R-squared:,0.957
Method:,Least Squares,F-statistic:,23820.0
Date:,"Thu, 22 Oct 2020",Prob (F-statistic):,0.0
Time:,14:01:04,Log-Likelihood:,-4038.0
No. Observations:,1060,AIC:,8080.0
Df Residuals:,1058,BIC:,8090.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.1711,0.555,2.112,0.035,0.083,2.259
x1,1.0132,0.007,154.346,0.000,1.000,1.026

0,1,2,3
Omnibus:,159.05,Durbin-Watson:,2.009
Prob(Omnibus):,0.0,Jarque-Bera (JB):,880.546
Skew:,0.556,Prob(JB):,6.1900000000000005e-192
Kurtosis:,7.324,Cond. No.,140.0


### Words

In [None]:
fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

In [None]:
#
#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1965-0219') & (df['s_rsen']==830)]
outdf

Unnamed: 0,m_descriptor,t_lan_E,t_version,s_rsen,c_id,e_content_E,chars_E,words_E,t_lan_V,e_top,be_top,c_created_at,count,c_kind,c_base,a_role,u_name,e_content_V,chars_V,words_V
829,1965-0219,ENG,15-1101,830,306851,"And remember, Ab...",160,43,LUG,M,N,2018-08-17 13:54...,1,V,c,TE,julmuk,208 Kati jjukira...,205,52


In [None]:
pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])

And remember, Abraham, his name was *Abram* a few days before that, and Sarah was *Sarra* before that; S-a-r-r-a then S-a-r-a-h, and A-b-r-a-m to A-b-r-a-h-a-m.
208 Kati jjukira, Ibulayimu, erinnya lye yali Ibulaamu emabegako ng’ekyo tekinnabaawo, era ne Saala nga ye Salayi ekyo nga tekinnabaawo; S-a-l-a-y-i kati S-a-a-l-a, ne I-b-u-l-a-a-m-u mu I-b-u-l-a-y-i-m-u.


This data-point could be an outlier.

In [None]:
df = df.drop(829)

In [None]:
#hide
fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'},
                 trendline="ols", 
                 trendline_color_override='black')
fig.show()
results = px.get_trendline_results(fig)
results.px_fit_results.iloc[0].summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.918
Model:,OLS,Adj. R-squared:,0.918
Method:,Least Squares,F-statistic:,11840.0
Date:,"Thu, 22 Oct 2020",Prob (F-statistic):,0.0
Time:,14:01:17,Log-Likelihood:,-2401.3
No. Observations:,1059,AIC:,4807.0
Df Residuals:,1057,BIC:,4817.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0049,0.119,0.041,0.967,-0.229,0.238
x1,0.7876,0.007,108.809,0.000,0.773,0.802

0,1,2,3
Omnibus:,132.222,Durbin-Watson:,2.06
Prob(Omnibus):,0.0,Jarque-Bera (JB):,788.495
Skew:,0.391,Prob(JB):,6.03e-172
Kurtosis:,7.154,Cond. No.,27.3


In [None]:
df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-LUG-output.csv', sep='~', index = False, header=True)

## Language POB

In [None]:
df = pd.read_csv(f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_2-POB-output.csv', sep='~')
df.loc[:5, df.columns.isin(['m_descriptor','t_lan_E','t_version','s_rsen','c_id','chars_E','words_E','t_lan_V','chars_V','words_V'])]

Unnamed: 0,m_descriptor,t_lan_E,t_version,s_rsen,c_id,chars_E,words_E,t_lan_V,chars_V,words_V
0,1950-0110,ENG,15-0901,1,51236,66,14,POB,73,14
1,1950-0110,ENG,15-0901,2,51237,105,22,POB,113,19
2,1950-0110,ENG,15-0901,3,51238,87,19,POB,89,17
3,1950-0110,ENG,15-0901,4,51239,90,17,POB,106,19
4,1950-0110,ENG,15-0901,5,51240,61,11,POB,64,10
5,1950-0110,ENG,15-0901,6,51241,94,18,POB,96,17


In [None]:
#hide
# df

### Characters

In [None]:
fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

In [None]:
#
#outlier
pd.set_option('display.max_colwidth',20)
outdf = df[(df['m_descriptor']=='1963-0412x') & (df['s_rsen']==1283)]
outdf

Unnamed: 0,m_descriptor,t_lan_E,t_version,s_rsen,c_id,e_content_E,chars_E,words_E,t_lan_V,e_top,be_top,c_created_at,count,c_kind,c_base,a_role,u_name,e_content_V,chars_V,words_V
2170,1963-0412x,ENG,15-0402,1283,578647,You ought to go ...,107,20,POB,M,N,2019-07-06 18:10...,1,V,c,CE,calamo,Vocês deveriam v...,162,28


In [None]:
pd.set_option('display.max_colwidth',1000)
print(outdf.loc[:,['e_content_E']].values[0][0])
print(outdf.loc[:,['e_content_V']].values[0][0])

You ought to go back into Africa, the Hottentots, and let them kill an animal and blood theirself all over.
Vocês deveriam voltar para a África, os Hottentots [os pastores nômades indígenas não-bantus da África do Sul], e deixá-los matar um animal e sangue deles mesmos.


This data-point is an outlier due to the over-translation.

In [None]:
df = df.drop(2170)

In [None]:
#hide
fig = px.scatter(data_frame=df, x='chars_E', y='chars_V', color='t_lan_V', 
                 title='Translation Characters vs English Characters', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'},
                 trendline="ols", trendline_color_override='black')
fig.show()
results = px.get_trendline_results(fig)
results.px_fit_results.iloc[0].summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.972
Model:,OLS,Adj. R-squared:,0.972
Method:,Least Squares,F-statistic:,93450.0
Date:,"Thu, 22 Oct 2020",Prob (F-statistic):,0.0
Time:,14:01:42,Log-Likelihood:,-9313.6
No. Observations:,2706,AIC:,18630.0
Df Residuals:,2704,BIC:,18640.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.9075,0.226,4.020,0.000,0.465,1.350
x1,0.9946,0.003,305.703,0.000,0.988,1.001

0,1,2,3
Omnibus:,257.901,Durbin-Watson:,1.931
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1841.319
Skew:,-0.067,Prob(JB):,0.0
Kurtosis:,7.039,Cond. No.,108.0


### Words

In [None]:
fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'})
fig.show()

In [None]:
#hide
fig = px.scatter(data_frame=df, x='words_E', y='words_V', color='t_lan_V', 
                 title='Translation Words vs English Words', 
                 opacity=.5, 
                 hover_data=['m_descriptor','t_lan_V','s_rsen'], 
                 labels={'m_descriptor':'Descriptor','t_lan_V':'','s_rsen':'Sentence No'},
                 trendline="ols", 
                 trendline_color_override='black')
fig.show()
results = px.get_trendline_results(fig)
results.px_fit_results.iloc[0].summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.965
Model:,OLS,Adj. R-squared:,0.965
Method:,Least Squares,F-statistic:,74840.0
Date:,"Thu, 22 Oct 2020",Prob (F-statistic):,0.0
Time:,14:01:48,Log-Likelihood:,-4981.0
No. Observations:,2706,AIC:,9966.0
Df Residuals:,2704,BIC:,9978.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0204,0.046,0.439,0.661,-0.071,0.111
x1,0.9277,0.003,273.572,0.000,0.921,0.934

0,1,2,3
Omnibus:,186.693,Durbin-Watson:,2.018
Prob(Omnibus):,0.0,Jarque-Bera (JB):,898.622
Skew:,-0.074,Prob(JB):,7.36e-196
Kurtosis:,5.819,Cond. No.,21.8


In [None]:
df.to_csv (f'{PATH}/PredictTranslationWordAndCharCount/PredictTranslationWordAndCharCount_3-POB-output.csv', sep='~', index = False, header=True)