# OpenHebrewBible (OHB) CSV Data to SQL Conversion V2

Eliran Wong used date from [ETCBC](https://github.com/ETCBC/bhsa) (Hebrew text BHSA, linguistic annotations, morphology, etc.), [OpenScriptures](https://github.com/openscriptures/morphhb) (Hebrew text WLC, Strong's numbers, morphology, etc.), and [Berean.bible](https://berean.bible) (interlinear translation, Berean Study Bible, etc.) to create a robust data repository called [OpenHebrewBible](https://github.com/eliranwong/OpenHebrewBible), consisting of CSV files that bridge the other three open-source projects.

I will take his compiled data file, [BHSA-with-extended-features.csv](https://github.com/eliranwong/OpenHebrewBible/blob/master/BHSA-with-extended-features.csv.zip), clean it, and convert it into a SQL database that I can use in my Flutter app. He also converted the BHSA TF 4c word data into a SQL DB file, [ETCBC4c.db](https://github.com/eliranwong/ETCBC-recycle/blob/master/sqlite3/ETCBC4c.db.zip). I will convert both the CSV and DB files to dataframes and combine useful data into a new dataframe. Later I will compare the converted combined dataframe to a BHSA SQL file, which can be downloaded [here](https://www.adambaker.org/bhsa.sqlite). -- side node: See BHSA 4C generated by James Cuenod for direct DB creation from TF API - I will be using his method to add clause_atom data -- Finally, I will convert the dataframe to a SQL database (after testing its data). Added: Incorporate STEP's Strong number data into a dataframe and also add as a table to the database.

**KEY**
- OHB_EXTENDED: [BHSA-with-extended-features.csv](https://github.com/eliranwong/OpenHebrewBible/blob/master/BHSA-with-extended-features.csv.zip)
- OHB_DB: [ETCBC4c.db](https://github.com/eliranwong/ETCBC-recycle/blob/master/sqlite3/ETCBC4c.db.zip)
- BHSA_DB: [bhsa.sqlite](https://www.adambaker.org/bhsa.sqlite)
- BH4C_DB: 4c.db, generated directly from TF API by [James](https://github.com/jcuenod/parabible-data-pipeline/blob/master/hb-bhs-pipe/scripts/create_sql_from_tf.py) (set version to c: A = use('bhsa', hoist=globals(), checkout='local', version='c'))
- TBESH_DB: [TBESH.csv](https://github.com/STEPBible/STEPBible-Data/blob/master/TBESH%20-%20Translators%20Brief%20lexicon%20of%20Extended%20Strongs%20for%20Hebrew%20-%20STEPBible.org%20CC%20BY.txt) (after removing the descriptive text at the top of the document).

### Why use OHB data when BHSA already exists?
Some features that could be useful from OHB data that aren't present in BHSA are:
- Strong's number mapped to each node in the BHS.
- Data to align the BHS text with KJV and BSB translations.
- Poetic divisions (not tested).
- BSB gloss for a more accurate rendering of each word. 

## Imports

In [1]:
# Requirements: run in terminal. Change to 'pip' if on Windows OS. 
"""
pip3 install pandas
pip3 install numpy
pip3 install text-fabric
pip3 install jupyter
"""

'\npip3 install pandas\npip3 install numpy\npip3 install text-fabric\npip3 install jupyter\n'

In [1]:
import pandas as pd
import sqlite3
import numpy as np
import copy
from tf.app import use
from IPython.display import display, HTML

pd.set_option('display.max_columns', None)

## Constants

In [7]:
# Files
BH4C_DB_PATH = '../data_files/BHSA/4c.db'
BHSA_DB_PATH = '../data_files/BHSA/bhsa.sqlite'
OHB_DB_PATH = '../data_files/OHB/ETCBC4c.db'
OHB_EXTENDED_PATH = '../data_files/OHB/BHSA-with-extended-features.csv'
TBESH_PATH = '../data_files/STEP/TBESH.csv'

In [8]:
# DB Connections and Dataframes
BH4C_DB_CON = sqlite3.connect(BH4C_DB_PATH)
BH4C_DB_DF = pd.read_sql_query("SELECT * FROM word_features", BH4C_DB_CON)
print('BH4C_DB Data loaded')

BHSA_DB_CON = sqlite3.connect(BHSA_DB_PATH)
BHSA_DB_DF = pd.read_sql_query("SELECT * FROM word", BHSA_DB_CON)
print('BHSA_DB Data loaded')

OHB_DB_CON = sqlite3.connect(OHB_DB_PATH)
OHB_DB_DF = pd.read_sql_query("SELECT * FROM data", OHB_DB_CON)
print('OHB_DB Data loaded')

# Set low_memory to False to deal with unexpected data types. 
# Converts those data to NaN.
OHB_EXTENDED_DF = pd.read_csv(OHB_EXTENDED_PATH, sep='\t', low_memory=False)
print('OHB_EXTENDED Data loaded')

TBESH_DF = pd.read_csv(TBESH_PATH, sep='\t', low_memory=False, index_col=False)
print('TBESH_DF Data loaded')

BH4C_DB Data loaded
BHSA_DB Data loaded
OHB_DB Data loaded
OHB_EXTENDED Data loaded
TBESH_DF Data loaded


## The Data

The BHSA_DB file consists of all the TF word data from the BHSA dataset -- every word in the BHS Hebrew Old Testament as a node with dozens of features.

The OHB_DB file consists of some of that data along with KJV verse and chapter alignment. 

The OHB_EXTENDED file consists of some of the BHSA data along with added features. It has 22 feature columns and uses tab-separated delineation. All of the data consists of strings or positive integers. 

You can view the three dataframes below.

In [5]:
# BHSA_DB Word Data
display(HTML(BHSA_DB_DF.head(n=3).to_html(index=False)))

_id,freq_lex,freq_occ,g_cons,g_cons_utf8,g_lex,g_lex_utf8,g_nme,g_nme_utf8,g_pfm,g_pfm_utf8,g_prs,g_prs_utf8,g_uvf,g_uvf_utf8,g_vbe,g_vbe_utf8,g_vbs,g_vbs_utf8,g_word,g_word_utf8,gloss,gn,kq_hybrid,kq_hybrid_utf8,language,languageISO,lex,lex0,lex_utf8,lexeme_count,ls,nametype,nme,nu,number,pdp,pfm,prs,prs_gn,prs_nu,prs_ps,ps,qere,qere_trailer,qere_trailer_utf8,qere_utf8,rank_lex,rank_occ,sp,st,suffix_gender,suffix_number,suffix_person,trailer,trailer_utf8,uvf,vbe,vbs,voc_lex,voc_lex_utf8,vs,vt
1,15542,14194,B,ב,B.:-,בְּ,,,,,,,,,,,,,B.:-,בְּ,in,,,,Hebrew,hbo,B,B,ב,0,none,,,,1,prep,,absent,unknown,unknown,unknown,,,,,,3,3,prep,,unknown,unknown,unknown,,,absent,,,B.:,בְּ,,
2,51,45,R>CJT,ראשׁית,R;>CIJT,רֵאשִׁית,/,֜,,,,,,,,,,,R;>CI73JT,רֵאשִׁ֖ית,beginning,f,,,Hebrew,hbo,R>CJT/,R>CJT,ראשׁית,0,none,,,sg,2,subs,,absent,unknown,unknown,unknown,,,,,,706,868,subs,a,unknown,unknown,unknown,,,absent,,,R;>CIJT,רֵאשִׁית,,
3,48,15,BR>,ברא,B.@R@>,בָּרָא,,,,,,,,,[,,,,B.@R@74>,בָּרָ֣א,create,m,,,Hebrew,hbo,BR>[,BR>,ברא,0,none,,absent,sg,3,verb,absent,absent,unknown,unknown,unknown,p3,,,,,745,2341,verb,,unknown,unknown,unknown,,,absent,,absent,BR>,ברא,qal,perf


In [6]:
# OHB_DB Data
display(HTML(OHB_DB_DF.head(n=3).to_html(index=False)))

word_ID,Book,ch_BHS,v_BHS,ch_KJV,v_KJV,manuscript,transliteration,lex_Hebrew,lex_number,gloss_Eng,lang,lang_def,morph_pdp,morph_pdp_def,morph_sp,morph_sp_def,morph_vs,morph_vs_def,morph_vt,morph_vt_def,morph_ps,morph_ps_def,morph_gn,morph_gn_def,morph_nu,morph_nu_def,morph_st,morph_st_def,prs_ps,prs_ps_def,prs_gn,prs_gn_def,prs_nu,prs_nu_def,clause_markers,clause_kind,clause_typ,clause_rela,phrase_markers,phrase_typ,phrase_rela,phrase_det,phrase_function
1,Gen,1,1,1,1,בְּ,bᵊ,בְּ,L70001,in,hbo,Ancient Hebrew,prep,preposition,prep,preposition,,not applicable,,not applicable,,not applicable,,not applicable,,not applicable,,not applicable,unknown,unknown,unknown,unknown,unknown,unknown,「,Verbal clauses,x-qatal-X clause,,『,Prepositional phrase,,undetermined,Time reference
2,Gen,1,1,1,1,רֵאשִׁ֖ית,rēšˌîṯ,רֵאשִׁית,L70002,beginning,hbo,Ancient Hebrew,subs,noun,subs,noun,,not applicable,,not applicable,,not applicable,f,feminine,sg,singular,a,absolute,unknown,unknown,unknown,unknown,unknown,unknown,,Verbal clauses,x-qatal-X clause,,』,Prepositional phrase,,undetermined,Time reference
3,Gen,1,1,1,1,בָּרָ֣א,bārˈā,ברא,L70003,create,hbo,Ancient Hebrew,verb,verb,verb,verb,qal,qal,perf,perfect,p3,third person,m,masculine,sg,singular,,not applicable,unknown,unknown,unknown,unknown,unknown,unknown,,Verbal clauses,x-qatal-X clause,,『』,Verbal phrase,,,Predicate


In [7]:
# OHB_EXTENDED Data
display(HTML(OHB_EXTENDED_DF.head(n=3).to_html(index=False)))

BHSwordSort,paragraphMarker,poetryMarker,〔KJVverseSort｜KJVbook｜KJVchapter｜KJVverse〕,〔BHSverseSort｜BHSbook｜BHSchapter｜BHSverse〕,clauseID,clauseKind,clauseType,language,BHSwordPointed,BHSwordConsonantal,SBLstyleTransliteration,poneticTranscription,HebrewLexeme,lexemeID,StrongNumber,extendedStrongNumber,morphologyCode,morphologyDetail,ETCBCgloss,extendedGloss,〔BSBsort＠BSB〕
1,¶,,〔1｜1｜1｜1〕,〔1｜1｜1｜1〕,c1,Verbal clauses,x-qatal-X clause,Hebrew,<heb>בְּ</heb><heb></heb>,<heb>ב</heb><heb></heb>,bĕ,bᵊ,<heb>בְּ</heb>,E70001,,H9003,prep,preposition,in,in,〔1＠In〕
2,,,〔1｜1｜1｜1〕,〔1｜1｜1｜1〕,c1,Verbal clauses,x-qatal-X clause,Hebrew,<heb>רֵאשִׁ֖ית</heb><heb> </heb>,<heb>ראשית</heb><heb> </heb>,rēšît,rēšˌîṯ,<heb>רֵאשִׁית</heb>,E70002,H7225,H7225,subs.f.sg.a,"noun, feminine, singular, absolute",beginning,beginning,〔2＠the beginning〕
3,,,〔1｜1｜1｜1〕,〔1｜1｜1｜1〕,c1,Verbal clauses,x-qatal-X clause,Hebrew,<heb>בָּרָ֣א</heb><heb> </heb>,<heb>ברא</heb><heb> </heb>,bārā,bārˈā,<heb>ברא</heb>,E70003,H1254,H1254,verb.qal.perf.p3.m.sg,"verb, qal, perfect, third person, masculine, singular",create,[he]+ create,〔4＠created〕


In [5]:
# OHB_EXTENDED Data
display(HTML(TBESH_DF.head(n=3).to_html(index=False)))

strongs,lex,transliteration,morph,gloss,definition
H0001,אָב,av,H:N-M,father,"1) father of an individual<br>2) of God as father of his people<br>3) head or founder of a household, group, family, or clan<br>4) ancestor<br>4a) grandfather, forefathers - of person<br>4b) of people<br>5) originator or patron of a class, profession, or art<br>6) of producer, generator (fig.)<br>7) of benevolence and protection (fig.)<br>8) term of respect and honour<br>9) ruler or chief (spec.)<br>"
H0002,אַב,av,A:N-M,father,1) father<br>
H0003,אֵב,ev,H:N-M,greenery,"1) freshness, fresh green, green shoots, or greenery<br>"


## Cleaning the Data (OHB_EXTENDED)

### Data to clean:
```
- 〔KJVverseSort｜KJVbook｜KJVchapter｜KJVverse〕: remove chars and place ints in new columns
- 〔BHSverseSort｜BHSbook｜BHSchapter｜BHSverse〕: remove chars and place ints in new columns
-  clauseID : remove 'c' prefix and convert to int
-  BHSwordPointed : remove html tags, place word in list - trailer in new list -> new column
-  BHSwordConsonantal : remove html tags, place word in list
-  HebrewLexeme : remove html tags
- 〔BSBsort＠BSB〕: remove chars and place int and string in new columns
```

### 1. Clean Text and Clause Data

**Important note:** certain nodes do not have a text value for *BHSwordPointed* or *BHSwordConsonantal* because of the nature of the Hebrew language. For example at BHS word node values 61-62 we have:
```
61  <heb>לָ</heb><heb></heb>     <heb>ל</heb><heb></heb>     <heb>לְ</heb>    H9005   prep    to	
62  <heb></heb><heb></heb>      <heb></heb><heb></heb>      <heb>הַ</heb>    H9009   art     the	〔51＠the〕
```
This is from a clause in Genesis 5:1, with the Hebrew: וַיִּקְרָא אֱלֹהִים לָאוֹר יוֹם

Node 62 is embedded into the word לָאוֹר, attached to the preposition via a patach, but the *he* (the) doesn't appear consonantly in the text.


##### Visualize pre-cleaned data

In [8]:
# Data before being cleaned.
display(HTML(
    OHB_EXTENDED_DF[
        ["BHSwordPointed", 
        "BHSwordConsonantal", 
        "HebrewLexeme", 
        "clauseID"]
    ].head(n=3).to_html(index=False))
)

BHSwordPointed,BHSwordConsonantal,HebrewLexeme,clauseID
<heb>בְּ</heb><heb></heb>,<heb>ב</heb><heb></heb>,<heb>בְּ</heb>,c1
<heb>רֵאשִׁ֖ית</heb><heb> </heb>,<heb>ראשית</heb><heb> </heb>,<heb>רֵאשִׁית</heb>,c1
<heb>בָּרָ֣א</heb><heb> </heb>,<heb>ברא</heb><heb> </heb>,<heb>ברא</heb>,c1


##### Define functions

In [57]:
# Textual items in between word nodes, including paragraph markers, etc. 
text_extensions = {
    '', '׃', '׃ ׆ ס ', ' ס ', '׃ ׆ ', '׃ ', ' ׀ ',
    ' ', '׃ פ ', ' פ ', '׀ ', '׃ ׆ פ ', '־', '׃ ס '
}

# ---
# Function that takes a column name from the original df and
# returns cleaned text (word and extension) in two lists.
# Use: BHS pointed and consonantal text, all of which is of a format similar
# to : <heb>הָ</heb><heb></heb>. Be sure to update the df with the return value. 
def clean_text(col_name):
    cleaned_text = []
    trailers = []
    # All of the junk html text present.
    remove_items = "/<arc>hebqrQR"
    # Either of these will appear between the word and extension.
    seperator = ["</heb><heb>", "</arc><arc>"]
    # Iterate over the original dataframe and clean the data. 
    for text_data in OHB_EXTENDED_DF[col_name]:
        # Place | at center so we can later split the text data. 
        for sep in seperator:
            if sep in text_data:
                text_data = text_data.replace(sep, '|')
        # Remove all extra items.
        for char in remove_items:
            if char in text_data:
                text_data = text_data.replace(char, "")

        # Note: I originally split each text and stored it in a list before 
        # appending to cleaned_text, but that caused an error when uploading 
        # to SQL because it needed an actual data type (e.g., string).
        
        # Add a text separated by | to cleaned text where pre '|' is 
        # a Heb word and post '|' is the trailer.
        word = text_data.split('|')[0]
        trailer = text_data.split('|')[1]
        cleaned_text.append(word)
        trailers.append(trailer)
    
    return cleaned_text, trailers


# ---
# Clean the text in the HebrewLexem column, all of which is
# in a format similar to: <heb>הָ</heb>. 
def clean_lexemes(col_name):
    cleaned_text = []
    # Read comments from clean_text()
    remove_items = "/<arc>hebqrQR"
    # Iterate over the original dataframe and clean the data. 
    for text_data in OHB_EXTENDED_DF[col_name]:
        # Remove all extra items.
        for char in remove_items:
            if char in text_data:
                text_data = text_data.replace(char, "")
        # Add the lexeme to the cleaned data.
        cleaned_text.append(text_data)

    return cleaned_text


# ---
# All clause data is of the format: c12. Remove the 'c's
# in the clause data and convert to int type. 
def clean_clauses(col_name):
    cleaned_ids = []
    # Iterate over the original dataframe and clean the data. 
    for clause in OHB_EXTENDED_DF[col_name]:
        cleaned_ids.append(int(clause.strip("c")))
        
    return cleaned_ids

# ---
# All lexemeID data is of the format: E70001. Remove the 'E's
# in the clause data and convert to int type - 70000. 
def clean_lex_ids(col_name):
    cleaned_ids = []
    # Iterate over the original dataframe and clean the data. 
    for clause in OHB_EXTENDED_DF[col_name]:
        cleaned_ids.append(int(clause.strip("E")) - 70000)
        
    return cleaned_ids

##### Call the functions -> update dataframe

In [58]:
# Update the data frame with the cleaned text and clauses. 
words, trailers = clean_text("BHSwordPointed")
OHB_EXTENDED_DF["BHSwordPointed"] = words
OHB_EXTENDED_DF.insert(
    OHB_EXTENDED_DF.columns.get_loc("BHSwordPointed"), 'Trailer', trailers)
OHB_EXTENDED_DF["BHSwordConsonantal"] = clean_text("BHSwordConsonantal")[0]
OHB_EXTENDED_DF["HebrewLexeme"] = clean_lexemes("HebrewLexeme")
OHB_EXTENDED_DF["clauseID"] = clean_clauses("clauseID")
OHB_EXTENDED_DF["lexemeID"] = clean_lex_ids("lexemeID")

##### Visualize cleaned data

In [11]:
# Print the head with the cleaned data.
display(HTML(
    OHB_EXTENDED_DF[
        ["BHSwordPointed", 
        "BHSwordConsonantal", 
        "Trailer",
        "HebrewLexeme", 
        "lexemeID",
        "clauseID"]
    ].head(3).to_html(index=False))
)

BHSwordPointed,BHSwordConsonantal,Trailer,HebrewLexeme,lexemeID,clauseID
בְּ,ב,,בְּ,1,1
רֵאשִׁ֖ית,ראשית,,רֵאשִׁית,2,1
בָּרָ֣א,ברא,,ברא,3,1


### 2. Expand KJV, BHS, and BSB columns

In the original 22 columns there are three features that consist of concatenated values:

- 〔KJVverseSort｜KJVbook｜KJVchapter｜KJVverse〕:〔1｜1｜1｜1〕  
- 〔BHSverseSort｜BHSbook｜BHSchapter｜BHSverse〕:〔1｜1｜1｜1〕
- 〔BSBsort＠BSB〕:〔1＠In〕

I will convert each of them to new dataframes with separate columns for each value, and then merge them back into the original dataframe.

##### Visualize pre-cleaned data

In [12]:
# Data before being cleaned.
display(HTML(
    OHB_EXTENDED_DF[
        ['〔KJVverseSort｜KJVbook｜KJVchapter｜KJVverse〕', 
        '〔BHSverseSort｜BHSbook｜BHSchapter｜BHSverse〕',
        '〔BSBsort＠BSB〕']
    ].head(3).to_html(index=False))
)

〔KJVverseSort｜KJVbook｜KJVchapter｜KJVverse〕,〔BHSverseSort｜BHSbook｜BHSchapter｜BHSverse〕,〔BSBsort＠BSB〕
〔1｜1｜1｜1〕,〔1｜1｜1｜1〕,〔1＠In〕
〔1｜1｜1｜1〕,〔1｜1｜1｜1〕,〔2＠the beginning〕
〔1｜1｜1｜1〕,〔1｜1｜1｜1〕,〔4＠created〕


##### Define functions

In [59]:
# Where the value is a list, convert the original 
# column to len(list) new columns. 
updated_col_names = {
    '〔KJVverseSort｜KJVbook｜KJVchapter｜KJVverse〕': # cleaned
        ['KVJvsNode', 'KJVbook', 'KJVchapter', 'KJVverse'], 
    '〔BHSverseSort｜BHSbook｜BHSchapter｜BHSverse〕': # cleaned
        ['BHSvsNode', 'BHSbook', 'BHSchapter', 'BHSverse'], 
    '〔BSBsort＠BSB〕': # cleaned
        ['BSBglossNode', 'BSBgloss']
}

# ---
# Function that takes a column name from 
# the original df and returns cleaned data.
# Use: convert the KJV ref or BHS ref column to new dataframe. 
def clean_references(col_name):
    # Create a dict for each new name to the new values.
    new_names = updated_col_names[col_name]
    cleaned_data = {name:[] for name in new_names}
    # Iterate over the original dataframe and clean the data. 
    for ref_data in OHB_EXTENDED_DF[col_name]:
        # Remove the outsides of 〔1｜1｜1｜1〕.
        ref_data = ref_data.strip('〕〔')
        # Split 1｜1｜1｜1 and convert each item to an int.
        ref_data = [int(data) for data in ref_data.split('｜')]
        # Add data to the dictionary. 
        cleaned_data[new_names[0]].append(ref_data[0]) # vs node
        cleaned_data[new_names[1]].append(ref_data[1]) # book
        cleaned_data[new_names[2]].append(ref_data[2]) # chapter
        cleaned_data[new_names[3]].append(ref_data[3]) # verse
   
    # Convert the dictionary to a dataframe and return.
    new_df = pd.DataFrame(cleaned_data)
    return new_df

# ---
# Clean the BSB gloss data and store in a new dataframe. 
def clean_gloss(col_name):
    # Create a dict for each new name to the new values.
    new_names = updated_col_names[col_name]
    cleaned_data = {name:[] for name in new_names}
    # Iterate over the original dataframe and clean the data. 
    for gloss_data in OHB_EXTENDED_DF[col_name]:
        # Catch edge cases where gloss_data is NaN.
        if isinstance(gloss_data, str):
            # Remove the outsides of 〔1＠In〕.
            gloss_data = gloss_data.strip('〕〔')
            # Split 1＠In and convert first item to an int.
            gloss_data = gloss_data.split('＠')
            
            # For some reason, gloss node 237839 is split into 
            # decimals .1 and .2, which is why I am using a float. 

            # Add data to the dictionary. 
            gloss_data[0] = float(gloss_data[0])
            cleaned_data[new_names[0]].append(gloss_data[0]) # gloss node
            cleaned_data[new_names[1]].append(gloss_data[1]) # gloss
        else:
            cleaned_data[new_names[0]].append(gloss_data) # gloss node
            cleaned_data[new_names[1]].append(gloss_data) # gloss

    # Convert the dictionary to a dataframe and return.
    new_df = pd.DataFrame(cleaned_data)
    return new_df

##### Call the functions -> new dataframes

In [60]:
# Clean the reference data and store in two new dataframes. 
KJV_ref_df = clean_references(
    '〔KJVverseSort｜KJVbook｜KJVchapter｜KJVverse〕')
BHS_ref_df = clean_references(
    '〔BHSverseSort｜BHSbook｜BHSchapter｜BHSverse〕')
BSB_gloss_df = clean_gloss(
    '〔BSBsort＠BSB〕')

##### Visualize cleaned data

In [15]:
# Print the head of the cleaned data.
display(HTML(
    pd.concat(
        [KJV_ref_df, 
        BHS_ref_df, 
        BSB_gloss_df],
        axis=1
    ).head(3).to_html(index=False))
)

KVJvsNode,KJVbook,KJVchapter,KJVverse,BHSvsNode,BHSbook,BHSchapter,BHSverse,BSBglossNode,BSBgloss
1,1,1,1,1,1,1,1,1.0,In
1,1,1,1,1,1,1,1,2.0,the beginning
1,1,1,1,1,1,1,1,4.0,created


### 3. Rename columns and combine dataframes

##### Define functions

In [61]:
# ---
# Drop the three columns that I expanded into new dataframes.
def drop_old_data(dateframe):
    dateframe = dateframe.drop(
        columns=[
        '〔KJVverseSort｜KJVbook｜KJVchapter｜KJVverse〕', 
        '〔BHSverseSort｜BHSbook｜BHSchapter｜BHSverse〕',
        '〔BSBsort＠BSB〕']
    )
    return dateframe

# ---
# Rename columns and add the three 
# new dataframes for a final df output. 
def combine_data():
    df_copy = copy.deepcopy(OHB_EXTENDED_DF)
    updated_df = pd.DataFrame()
    # Make sure the replaced data gets dropped. 
    if "〔BSBsort＠BSB〕" in df_copy.columns:
        df_copy = drop_old_data(df_copy)
    # Rename the other columns and build updated_df.
    for column in df_copy:
        # If at the column before 〔KJVverseSort..., 
        # add the new reference dataframes.
        if column == "poetryMarker":
            updated_df = pd.concat(
                [updated_df, 
                df_copy[column], 
                KJV_ref_df, 
                BHS_ref_df], 
                axis=1)
        # If at the column before 〔BSBsort..., 
        # add the new BSB dataframe.
        elif column == "extendedGloss":
            updated_df = pd.concat(
                [updated_df, 
                df_copy[column], 
                BSB_gloss_df], 
                axis=1)
        # Otherwise add the current column 
        # from the original dataframe.
        else:
            updated_df[column] = df_copy[column]

    return updated_df

##### Call function -> combined dataframe

In [62]:
# Store the combined data in a new dataframe. 
ohb_extended_cleaned = combine_data()

##### Visualize the final cleaned dataframe

In [18]:
# Display our newly cleaned and labeled data.
display(HTML(ohb_extended_cleaned.head(3).to_html(index=False)))

BHSwordSort,paragraphMarker,poetryMarker,KVJvsNode,KJVbook,KJVchapter,KJVverse,BHSvsNode,BHSbook,BHSchapter,BHSverse,clauseID,clauseKind,clauseType,language,Trailer,BHSwordPointed,BHSwordConsonantal,SBLstyleTransliteration,poneticTranscription,HebrewLexeme,lexemeID,StrongNumber,extendedStrongNumber,morphologyCode,morphologyDetail,ETCBCgloss,extendedGloss,BSBglossNode,BSBgloss
1,¶,,1,1,1,1,1,1,1,1,1,Verbal clauses,x-qatal-X clause,Hebrew,,בְּ,ב,bĕ,bᵊ,בְּ,1,,H9003,prep,preposition,in,in,1.0,In
2,,,1,1,1,1,1,1,1,1,1,Verbal clauses,x-qatal-X clause,Hebrew,,רֵאשִׁ֖ית,ראשית,rēšît,rēšˌîṯ,רֵאשִׁית,2,H7225,H7225,subs.f.sg.a,"noun, feminine, singular, absolute",beginning,beginning,2.0,the beginning
3,,,1,1,1,1,1,1,1,1,1,Verbal clauses,x-qatal-X clause,Hebrew,,בָּרָ֣א,ברא,bārā,bārˈā,ברא,3,H1254,H1254,verb.qal.perf.p3.m.sg,"verb, qal, perfect, third person, masculine, singular",create,[he]+ create,4.0,created


## Test OHB_EXTENDED Data Against OHB_DB

I want to compare values in the OHB_EXTENDED data to the OHB_DB data, especially text values, to make sure that the data is usable and accurate before merging features.

#### Feature mappings

In [63]:
col_map = {
    'ch_KJV': 'KJVchapter',
    'v_KJV': 'KJVverse',
    'ch_BHS': 'BHSchapter',
    'v_BHS': 'BHSverse',
    'clause_kind': 'clauseKind',
    'clause_typ': 'clauseType',
    'manuscript': 'BHSwordPointed',
    'lex_Hebrew': 'HebrewLexeme',
}

#### Tests

In [64]:
def check_refs():
    mismatches = {}
    for col in ['ch_KJV', 'v_KJV', 'ch_BHS', 'v_BHS']:
        ohb_ref = [r for r in OHB_DB_DF[col]]
        ext_ref = [r for r in ohb_extended_cleaned[col_map[col]]]
        for i, ref in enumerate(ohb_ref):
            if ref != ext_ref[i]:
                mismatches[str(i)+col] = (ref, ext_ref[i])
    return mismatches

def check_clauses():
    mismatches = {}
    for col in ['clause_kind', 'clause_typ']:
        ohb_clause = [r for r in OHB_DB_DF[col]]
        ext_clause = [r for r in ohb_extended_cleaned[col_map[col]]]
        for i, clause in enumerate(ohb_clause):
            if clause != ext_clause[i]:
                mismatches[str(i)+col] = (clause, ext_clause[i])
    return mismatches

def check_text():
    # OHB_DB manuscript value is the text + the trailer.
    text_extensions = [
    '׃', '׃ ׆ ס ', ' ס ', '׃ ׆ ', '׃ ', ' ׀ ',
    ' ', '׃ פ ', ' פ ', '׀ ', '׃ ׆ פ ', '־', '׃ ס ', '׀', 'פ', 'ס', '<', 'Q', 'R', '>', 'q', 'r', '׆'
    ]
    mismatches = {}
    for col in ['manuscript']:
        ohb_text = [w for w in OHB_DB_DF[col]]
        ext_text = [w for w in ohb_extended_cleaned[col_map[col]]]
        for i, w in enumerate(ohb_text):
            w2 = ext_text[i]
            dif = [i for i in list(w2) if i not in list(w)]
            dif_checked = [i for i in dif if i not in text_extensions]
            if len(dif_checked) > 0:
                mismatches[str(i)+col] = (w, w2)
    return mismatches

def check_lex():
    mismatches = {}
    for col in ['lex_Hebrew']:
        ohb_text = [w for w in OHB_DB_DF[col]]
        ext_text = [w for w in ohb_extended_cleaned[col_map[col]]]
        for i, w in enumerate(ohb_text):
            if w != ext_text[i]:
                mismatches[str(i)+col] = (w, ext_text[i])
    return mismatches

#### Test Results

In [21]:
ref_mismatches = check_refs()
message1 = f"Refs Unaligned\n{ref_mismatches}" if len(ref_mismatches) > 0 else "Refs Aligned"
print(message1)

clause_mismatches = check_clauses()
message2 = f"Clauses Unaligned\n{clause_mismatches}" if len(clause_mismatches) > 0 else "Clauses Aligned"
print(message2)

text_mismatches = check_text()
message3 = f"Text Unaligned\n{text_mismatches}" if len(text_mismatches) > 0 else "Text Aligned"
print(message3)

lex_mismatches = check_lex()
message4 = f"Lex Unaligned\n{lex_mismatches}" if len(lex_mismatches) > 0 else "Lex Aligned"
print(message4)

Refs Unaligned
{'152831v_KJV': (1, 2), '152832v_KJV': (1, 2), '152833v_KJV': (1, 2), '152834v_KJV': (1, 2), '152835v_KJV': (1, 2), '152836v_KJV': (1, 2), '152837v_KJV': (1, 2), '152838v_KJV': (1, 2), '191090v_KJV': (34, 33), '191091v_KJV': (34, 33), '191092v_KJV': (34, 33), '191093v_KJV': (34, 33), '191094v_KJV': (34, 33), '191095v_KJV': (34, 33), '191096v_KJV': (34, 33), '191097v_KJV': (34, 33), '191098v_KJV': (34, 33), '191099v_KJV': (34, 33), '191100v_KJV': (34, 33), '191101v_KJV': (34, 33), '191102v_KJV': (34, 33), '191103v_KJV': (34, 33), '191104v_KJV': (34, 33), '191962v_KJV': (3, 2), '191963v_KJV': (3, 2), '191964v_KJV': (3, 2), '191965v_KJV': (3, 2), '191966v_KJV': (3, 2), '191967v_KJV': (3, 2), '194111v_KJV': (21, 22), '194112v_KJV': (21, 22), '194113v_KJV': (21, 22), '194114v_KJV': (21, 22), '194115v_KJV': (21, 22), '194116v_KJV': (21, 22), '194578v_KJV': (44, 43), '194579v_KJV': (44, 43), '194580v_KJV': (44, 43), '194581v_KJV': (44, 43), '194582v_KJV': (44, 43), '194583v_KJV

**We have:**

Refs Unaligned

{'152831v_KJV': (1, 2), '152832v_KJV': (1, 2), '152833v_KJV': (1, 2), '152834v_KJV': (1, 2), '152835v_KJV': (1, 2), '152836v_KJV': (1, 2), '152837v_KJV': (1, 2), '152838v_KJV': (1, 2), '191090v_KJV': (34, 33), '191091v_KJV': (34, 33), '191092v_KJV': (34, 33), '191093v_KJV': (34, 33), '191094v_KJV': (34, 33), '191095v_KJV': (34, 33), '191096v_KJV': (34, 33), '191097v_KJV': (34, 33), '191098v_KJV': (34, 33), '191099v_KJV': (34, 33), '191100v_KJV': (34, 33), '191101v_KJV': (34, 33), '191102v_KJV': (34, 33), '191103v_KJV': (34, 33), '191104v_KJV': (34, 33), '191962v_KJV': (3, 2), '191963v_KJV': (3, 2), '191964v_KJV': (3, 2), '191965v_KJV': (3, 2), '191966v_KJV': (3, 2), '191967v_KJV': (3, 2), '194111v_KJV': (21, 22), '194112v_KJV': (21, 22), '194113v_KJV': (21, 22), '194114v_KJV': (21, 22), '194115v_KJV': (21, 22), '194116v_KJV': (21, 22), '194578v_KJV': (44, 43), '194579v_KJV': (44, 43), '194580v_KJV': (44, 43), '194581v_KJV': (44, 43), '194582v_KJV': (44, 43), '194583v_KJV': (44, 43), '194584v_KJV': (44, 43), '194585v_KJV': (44, 43), '194586v_KJV': (44, 43), '194587v_KJV': (44, 43), '194588v_KJV': (44, 43), '194589v_KJV': (44, 43), '194590v_KJV': (44, 43), '194591v_KJV': (44, 43)}

Clauses Aligned

Text Aligned

Lex Aligned

**Next Steps:**
Check references against STEP Bible.

In [22]:
# To visualize which books and chapters the divergent refs occur in. 
if len(ref_mismatches) > 0:
    formatted_refs = {}
    df = ohb_extended_cleaned
    for k in ref_mismatches:
        # Get just the number (word id)
        node = int(k[:6])
        bk = df.iloc[node]['KJVbook']
        ch = df.iloc[node]['KJVchapter']
        vs = df.iloc[node]['KJVverse']
        w = df.iloc[node]['BHSwordPointed']
        formatted_refs[k] = f"{bk}:{ch}:{vs} {w}"

# print(formatted_refs)

**Comparison to STEP Bible**

Nodes 152831-152838:
- ohb_db: 1
- ohb_ext: 2
- STEP: 2 (these words are in vs 1 of BHS)

Nodes 191090-191104:
- ohb_db: 34
- ohb_ext: 33
- STEP: 33 (these words are in vs 34 of BHS)

Nodes 191962-191967:
- ohb_db: 3
- ohb_ext: 2
- STEP: 2 (these words are in vs 3 of BHS)

Nodes 194111-194116:
- ohb_db: 21
- ohb_ext: 22
- STEP: 22 (these words are in vs 21 of BHS)

Nodes 194578-194591:
- ohb_db: 44
- ohb_ext: 43
- STEP: 43 (these words are in vs 43 of BHS)

**Observations:** The OHB_EXTENDED data accurately reflects the KJV. In most divergent cases, the OHB_DB data is reflecting the BHS verse value for a node rather than the KJV verse value. 


## Save OHB_EXTENDED as CSV file

In [65]:
ohb_cleaned_path = '../data_files/combined/ohb_extended_cleaned.csv'
ohb_extended_cleaned.to_csv(ohb_cleaned_path, index=False)

## Combine OHB_EXTENDED and OHB_DB and BHSA_DB features into new DF

In [66]:
OHB_COMBINED = copy.deepcopy(ohb_extended_cleaned)

# Data from OHB_DB
OHB_COMBINED['lang'] = OHB_DB_DF['lang']
OHB_COMBINED['phrase_typ'] = OHB_DB_DF['phrase_typ']
OHB_COMBINED['phrase_det'] = OHB_DB_DF['phrase_det']
OHB_COMBINED['phrase_function'] = OHB_DB_DF['phrase_function']

In [25]:
# Display our newly combined data.
display(HTML(OHB_COMBINED.head(3).to_html(index=False)))

BHSwordSort,paragraphMarker,poetryMarker,KVJvsNode,KJVbook,KJVchapter,KJVverse,BHSvsNode,BHSbook,BHSchapter,BHSverse,clauseID,clauseKind,clauseType,language,Trailer,BHSwordPointed,BHSwordConsonantal,SBLstyleTransliteration,poneticTranscription,HebrewLexeme,lexemeID,StrongNumber,extendedStrongNumber,morphologyCode,morphologyDetail,ETCBCgloss,extendedGloss,BSBglossNode,BSBgloss,lang,phrase_typ,phrase_det,phrase_function
1,¶,,1,1,1,1,1,1,1,1,1,Verbal clauses,x-qatal-X clause,Hebrew,,בְּ,ב,bĕ,bᵊ,בְּ,1,,H9003,prep,preposition,in,in,1.0,In,hbo,Prepositional phrase,undetermined,Time reference
2,,,1,1,1,1,1,1,1,1,1,Verbal clauses,x-qatal-X clause,Hebrew,,רֵאשִׁ֖ית,ראשית,rēšît,rēšˌîṯ,רֵאשִׁית,2,H7225,H7225,subs.f.sg.a,"noun, feminine, singular, absolute",beginning,beginning,2.0,the beginning,hbo,Prepositional phrase,undetermined,Time reference
3,,,1,1,1,1,1,1,1,1,1,Verbal clauses,x-qatal-X clause,Hebrew,,בָּרָ֣א,ברא,bārā,bārˈā,ברא,3,H1254,H1254,verb.qal.perf.p3.m.sg,"verb, qal, perfect, third person, masculine, singular",create,[he]+ create,4.0,created,hbo,Verbal phrase,,Predicate


In [67]:
# Save as new file.
ohb_combined_path = '../data_files/combined/ohb_combined.csv'
OHB_COMBINED.to_csv(ohb_combined_path, index=False)

### Align OHB_EXTENDED with BHSA_DB and Test

See notes [here](https://docs.google.com/document/d/1WE59plLi8EQTaDijHkdgPCVAwOc_TlQU4GvDjyEsPAA/edit?usp=sharing).

Drop node 16563 from the BHSA_DB_DF.

Expand node 392485 into three nodes and save the data in ohb_combined_aligned.csv.

Increment all node values that come after 392485. 

Set marked value 3924860 to 392489 and all after to i + 3.

In [72]:
# Load the aligned file that I manually edited
OHB_ALIGNED_PATH = '../data_files/combined/ohb_combined.csv'
OHB_ALIGNED = pd.read_csv(OHB_ALIGNED_PATH, sep=',', low_memory=False)

In [7]:
# Drop the SKIP row to import data into the new DB with alignment.
SKIP = 16563
BHSA_V2 = BHSA_DB_DF.drop(BHSA_DB_DF.index[SKIP-1])
BHSA_V2.reset_index(drop=True, inplace=True)

In [68]:
# Update OHB_ALIGNED
EXPANDED_NODE = 392485
FIXED_DATA = {
    EXPANDED_NODE: {
        'BHSwordPointed': 'חֲצִ֥י',
        'BHSwordConsonantal': 'חצי',
        'Trailer': ' ',
        'SBLstyleTransliteration': 'ḥăṣî',
        'HebrewLexeme': 'חֲצִי',
        'lexemeID': 2003,
        'extendedStrongNumber': 'H2677',
        'ETCBCgloss': 'half',
        'extendedGloss': 'half',
        'BSBgloss': np.nan,
        'BSBglossNode': np.nan,
    },
    EXPANDED_NODE+1: {
        'BHSwordPointed': 'הַ',
        'BHSwordConsonantal': 'ה',
        'Trailer': '',
        'SBLstyleTransliteration': 'ha',
        'HebrewLexeme': 'הַ',
        'lexemeID': 6,
        'extendedStrongNumber': 'H9009',
        'ETCBCgloss': 'the',
        'extendedGloss': 'the',
        'BSBgloss': np.nan,
        'BSBglossNode': np.nan,
    },
    EXPANDED_NODE+2: {
        'BHSwordPointed': 'מְּנֻחֹֽות',
        'BHSwordConsonantal': 'מנחות',
        'Trailer': ' ׃',
        'SBLstyleTransliteration': 'mĕnuḥôt',
        'HebrewLexeme': 'מְּנֻחֹות',
        'lexemeID': 8720,
        'extendedStrongNumber': 'H4506a',
        'ETCBCgloss': 'Manahathites',
        'extendedGloss': 'Manahathites',
        'BSBgloss': 'half the Manahathites',
        'BSBglossNode': 145601.0,
    },
    392519: {'extendedStrongNumber': 'H4506a'}
}

In [69]:
# Increment nodes.
def update_nodes(df):
    nodes = [n for n in df['BHSwordSort']]
    k = 0
    # Update node values
    for i, n in enumerate(nodes):
        if n == SKIP:
            nodes[i] += 1
            k += 1
        elif n == EXPANDED_NODE+1 and nodes[i-1] != EXPANDED_NODE+1:
            nodes[i] += k+2
            k += 2
        else:
            nodes[i] += k
    return nodes

In [70]:
def expand_node():
    df = copy.deepcopy(OHB_ALIGNED)
    i = df.index[df['BHSwordSort'] == EXPANDED_NODE][0]
    pre = df.loc[:i-1]
    mid = df.iloc[[i]]
    mid = pd.concat([mid]*3, ignore_index=True)
    mid['BHSwordSort'] = [EXPANDED_NODE+i for i in range(3)]
    post = df.loc[i+1:]
    df = pd.concat([pre, mid, post], ignore_index=True)
    df['BHSwordSort'] = update_nodes(df)

    for node in FIXED_DATA:
        for k in FIXED_DATA[node]:
            index =  df.index[df['BHSwordSort'] == node+1][0]
            df.at[index, k] = FIXED_DATA[node][k]

    return df

In [73]:
OHB_ALIGNED = expand_node()

### Tests

In [33]:
col_map = {
    'BHSwordSort': '_id',
    'BHSwordConsonantal': 'g_cons_utf8',
    'BHSwordPointed': 'g_word_utf8',
    'ETCBCgloss': 'gloss',
    'lang': 'languageISO',
    'Trailer': 'trailer_utf8',
    'HebrewLexeme': 'voc_lex_utf8'
}

In [34]:
for c in ['BHSwordConsonantal', 'BHSwordPointed', 'Trailer']:
    OHB_ALIGNED[c] = OHB_ALIGNED[c].replace({np.nan: ""})

def test_aligned():
    mismatches = {k:[] for k in col_map}
    qere_words = [i for i in BHSA_V2['qere_utf8']]  
    qere_trailers = [i for i in BHSA_V2['qere_trailer_utf8']]
    for col in col_map:
        ohb_data = [i for i in OHB_ALIGNED[col]]
        bhs_data = [i for i in BHSA_V2[col_map[col]]]
        for i, d in enumerate(ohb_data):
            # See https://etcbc.github.io/bhsa/features/qere_utf8/
            if col == 'BHSwordPointed':
                w = bhs_data[i] if not qere_words[i] else qere_words[i]
                if d != w:
                    mismatches[col].append((i+1, d, w))
            elif col == 'Trailer':
                t = bhs_data[i] if not qere_trailers[i] else qere_trailers[i]
                if d != t:
                    mismatches[col].append((i+1, d, t))
            # bhsa uses <> rather than [] for values like object marker.
            elif col == 'ETCBCgloss':
                d2 = bhs_data[i]
                dif = [i for i in list(d2) if i not in list(d)]
                dif_checked = [i for i in dif if i not in ['<','>','[',']']]
                if len(dif_checked) > 0:
                    mismatches[col].append((i+1, d, bhs_data[i]))
            elif d != bhs_data[i]:
                mismatches[col].append((i+1, d, bhs_data[i]))
    return mismatches

In [35]:
# Collect mismatch data and save csv files.
mismatches = test_aligned()
path = '../data_files/combined/mismatches/'
def export_mismatches():
    for k in mismatches:
        # the cons vals in BHSA don't have shin/sin differentiation.
        if k != 'BHSwordConsonantal' and len(mismatches[k]) > 0:
            data = {'node':[], 'ohb':[], 'bhsa':[]}
            for v in mismatches[k]:
                n, o, b = v
                data['node'].append(n)
                data['ohb'].append(o)
                data['bhsa'].append(b)
            df = pd.DataFrame(data)
            df.to_csv(f"{path}{k}.csv", index=False)

In [36]:
# Save csv files. 
export_mismatches()

### Mismatch notes

**word.csv:** 1869 mismatches that predominantly consist of g_word_utf8 lacking pointings. It seems best to use the OHB data in this case.

**lex.csv:** 3 mismatches
```
node	ohb	bhsa
152522	חַי	חַיִּים
392488	מְּנֻחֹות	מְנוּחָה
394199	ושׁני֜	וַשְׁנִי
```
**bhsa_gloss.csv:** 490 mismatches -- mostly repeats (e.g., where ohb has cloth and bhsa has clothe). BHSA seems to be more accurate here. 

**trailer.csv:** 150 mismatches

**Note:** the BHSA has [features](https://etcbc.github.io/bhsa/features/qere_utf8/) qere_utf8 and qere_trailer_utf8 that provide vocalized data when it is lacking in the *ketiv* form. I've updated the mismatches code to chose those vocalized options in the BHSA data when the ketiv form is missing pointings.

**UPDATED MISMATCHES**

**word.csv:** 2 differences
```
node	ohb	bhsa
199283		ה
205832	שֻׁ֝֩בו 	שֻׁ֝֠בוּ
```
**trailer.csv:** 6 differences, bhsa following qere. 
```
node	ohb	bhsa
137795	''	 ' ' 
156164	''	־
227810	''	 ' '
345548	''	 ' '
363613	''	 ' '
364988	''	 ' '
```

**CONCLUSIONS**

Most of these are insignificant. It is likely best to go with the BHSA gloss column rather than the OHB gloss column. 

In [37]:
# View specified rows in OHB and BHSA
i = 335242-3
display(HTML(OHB_ALIGNED.loc[i:i+5].to_html(index=False)))

BHSwordSort,paragraphMarker,poetryMarker,KVJvsNode,KJVbook,KJVchapter,KJVverse,BHSvsNode,BHSbook,BHSchapter,BHSverse,clauseID,clauseKind,clauseType,language,Trailer,BHSwordPointed,BHSwordConsonantal,SBLstyleTransliteration,poneticTranscription,HebrewLexeme,lexemeID,StrongNumber,extendedStrongNumber,morphologyCode,morphologyDetail,ETCBCgloss,extendedGloss,BSBglossNode,BSBgloss,lang,phrase_typ,phrase_det,phrase_function
335241,,,16322,19,145,1,17597,19,145,1,71573,Nominal clauses,Nominal clause,Hebrew,,דָ֫וִ֥ד,דוד,dāwid,ḏˈāwˌiḏ,דָּוִד,4258,H1732,H1732,nmpr.m.sg.a,"proper noun, masculine, singular, absolute",David,David,209386.0,Of David.,hbo,Nominal phrase,undetermined,Predicate complement
335242,¶,‡,16322,19,145,1,17597,19,145,1,71574,Verbal clauses,Zero-yiqtol-null clause,Hebrew,,אֲרֹומִמְךָ֣,ארוממך,ʾărômimĕkā,ʔᵃrômimᵊḵˈā,רום,413,H7311,H7311,verb.piel.impf.p1.u.sg.prs.p2.m.sg,"verb, pi“el, imperfect, first person, unknown, singular, pronominal suffix, second person, masculine, singular",be high,[I]+ be high +[you],209387.0,"I will exalt You,",hbo,Verbal phrase,,Predicate with object suffix
335243,,,16322,19,145,1,17597,19,145,1,71575,Clauses without predication,Vocative clause,Hebrew,,אֱלֹוהַ֣י,אלוהי,ʾĕlôhay,ʔᵉlôhˈay,אֱלֹהִים,4,H433,H433,subs.m.pl.a,"noun, masculine, plural, absolute",god(s),god [pl.],209388.0,my God,hbo,Nominal phrase,determined,Vocative
335244,,,16322,19,145,1,17597,19,145,1,71575,Clauses without predication,Vocative clause,Hebrew,,הַ,ה,ha,ha,הַ,6,,H9009,art,article,the,the,,,hbo,Nominal phrase,determined,Vocative
335245,,,16322,19,145,1,17597,19,145,1,71575,Clauses without predication,Vocative clause,Hebrew,,מֶּ֑לֶךְ,מלך,melek,mmˈeleḵ,מֶלֶךְ,671,H4428,H4428,subs.m.sg.a,"noun, masculine, singular, absolute",king,king,209389.0,[and] King;,hbo,Nominal phrase,determined,Vocative
335246,,,16322,19,145,1,17597,19,145,1,71576,Verbal clauses,We-yiqtol-null clause,Hebrew,,וַ,ו,wa,wa,וְ,8,,H9000,conj,conjunction,and,and,,,hbo,Conjunctive phrase,,Conjunction


In [38]:
display(HTML(BHSA_DB_DF.loc[i:i+10].to_html(index=False)))

_id,freq_lex,freq_occ,g_cons,g_cons_utf8,g_lex,g_lex_utf8,g_nme,g_nme_utf8,g_pfm,g_pfm_utf8,g_prs,g_prs_utf8,g_uvf,g_uvf_utf8,g_vbe,g_vbe_utf8,g_vbs,g_vbs_utf8,g_word,g_word_utf8,gloss,gn,kq_hybrid,kq_hybrid_utf8,language,languageISO,lex,lex0,lex_utf8,lexeme_count,ls,nametype,nme,nu,number,pdp,pfm,prs,prs_gn,prs_nu,prs_ps,ps,qere,qere_trailer,qere_trailer_utf8,qere_utf8,rank_lex,rank_occ,sp,st,suffix_gender,suffix_number,suffix_person,trailer,trailer_utf8,uvf,vbe,vbs,voc_lex,voc_lex_utf8,vs,vt
335240,20069,15641,L,ל,L:-,לְ,,,,,,,,,,,,,L:-,לְ,to,,,,Hebrew,hbo,L,L,ל,0,none,,,,24591,prep,,absent,unknown,unknown,unknown,,,,,,2,2,prep,,unknown,unknown,unknown,,,absent,,,L:,לְ,,
335241,1075,800,DWD,דוד,D@WID,דָוִד,/,֜,,,,,,,,,,,D@60WI71D,דָ֫וִ֥ד,David,m,,,Hebrew,hbo,DWD==/,DWD,דוד,0,none,pers,,sg,24592,nmpr,,,,,,,,,,,41,42,nmpr,a,,,,,,absent,,,D.@WID,דָּוִד,,
335242,188,4,>RWMMK,ארוממך,ROWMIM:,רֹומִםְ,,,!>:A!,אֲ,+K@,כָ,,,[,,,,>:AROWMIM:K@74,אֲרֹומִמְךָ֣,be high,unknown,,,Hebrew,hbo,RWM[,RWM,רום,0,none,,absent,sg,24593,verb,>,K,m,sg,p2,p1,,,,,251,6060,verb,,m,sg,p2,,,absent,,absent,RWM,רום,piel,impf
335243,2601,3,>LWHJ,אלוהי,>:ELOWH,אֱלֹוה,/AJ,ַ֜י,,,+,,,,,,,,>:ELOWHA74J,אֱלֹוהַ֣י,god(s),m,,,Hebrew,hbo,>LHJM/,>LHJM,אלהים,0,none,,J,pl,24594,subs,,J,unknown,sg,p1,,,,,,18,7211,subs,a,unknown,sg,p1,,,absent,,,>:ELOHIJM,אֱלֹהִים,,
335244,30386,24664,H,ה,HA-,הַ,,,,,,,,,,,,,HA-,הַ,the,,,,Hebrew,hbo,H,H,ה,0,none,,,,24595,art,,,,,,,,,,,1,1,art,,,,,,,absent,,,HA,הַ,,
335245,2523,2325,MLK,מלך,M.ELEK:,מֶּלֶךְ,/,֜,,,,,,,,,,,M.E92LEK:,מֶּ֑לֶךְ,king,m,,,Hebrew,hbo,MLK/,MLK,מלך,0,none,,,sg,24596,subs,,absent,unknown,unknown,unknown,,,,,,20,16,subs,a,unknown,unknown,unknown,,,absent,,,MELEK:,מֶלֶךְ,,
335246,50272,50238,W,ו,WA-,וַ,,,,,,,,,,,,,WA-,וַ,and,,,,Hebrew,hbo,W,W,ו,0,none,,,,24597,conj,,,,,,,,,,,0,0,conj,,,,,,,absent,,,W:,וְ,,
335247,327,3,>BRKH,אברכה,B@R:AK,בָרֲך,,,!>:A!,אֲ,,,,,[@H,ָה,,,>:AB@R:AK@71H,אֲבָרֲכָ֥ה,bless,unknown,,,Hebrew,hbo,BRK[,BRK,ברך,0,none,,absent,sg,24598,verb,>,absent,unknown,unknown,unknown,p1,,,,,152,7211,verb,,unknown,unknown,unknown,,,absent,H=,absent,BRK,ברך,piel,impf
335248,864,115,CMK,שׁמך,CIM:,שִׁםְ,/,֜,,,+K@,כָ,,,,,,,11CIM:K@81,שִׁ֝מְךָ֗,name,m,,,Hebrew,hbo,CM/,CM,שׁם,0,none,,,sg,24599,subs,,K,m,sg,p2,,,,,,55,367,subs,a,m,sg,p2,,,absent,,,C;M,שֵׁם,,
335249,20069,15641,L,ל,L:-,לְ,,,,,,,,,,,,,L:-,לְ,to,,,,Hebrew,hbo,L,L,ל,0,none,,,,24600,prep,,absent,unknown,unknown,unknown,,,,,,2,2,prep,,unknown,unknown,unknown,,,absent,,,L:,לְ,,


In [74]:
# SAVE DATA
final_path = '../data_files/combined/ohb_combined_aligned.csv'
OHB_ALIGNED.to_csv(final_path, index=False)

### Test Strong's Numbers

In [39]:
def check_strongs():
    strongs = [s for s in OHB_ALIGNED['strongs']]
    lexemes = [l for l in OHB_ALIGNED['lex']]
    lex_ids = [i for i in OHB_ALIGNED['lexId']]
    lex_mapped = []
    for i, id in enumerate(lex_ids):
        lex_mapped.append(f"{id} {lexemes[i]}")
    # Each key is a lex_id paired with its lex word value.
    lexIDs_with_mismatches = {id:{} for id in set(lex_mapped)}
    # Keep track of the lex_ids we've already checked.
    visited = set()
    # Iterate over all nodes. 
    for i, cur_sn in enumerate(strongs):
        cur_id = lex_ids[i]
        key = lex_mapped[i]
        lexIDs_with_mismatches[key][cur_sn] = [i+1]
        # If we haven't visited the current lex_id, compare it to the rest of the nodes.
        if cur_id not in visited:
            for j, new_id in enumerate(lex_ids[i:]):
                j += i
                new_sn = strongs[j]
                if new_id == cur_id:
                    # If we've reached the same lex, check its strong number
                    # against the current strong number and add the new sn as
                    # a key mapped to a list of nodes that this sn occurs. 
                    if cur_sn != new_sn:
                        if new_sn not in lexIDs_with_mismatches[key]:
                            lexIDs_with_mismatches[key][new_sn] = [j+1]
                        else:
                            lexIDs_with_mismatches[key][new_sn].append(j+1)
                    else:
                        lexIDs_with_mismatches[key][cur_sn].append(j+1)
        visited.add(cur_id)
        if i % 5000 == 0 and i > 1:
            print(i)
    return {k:v for k, v in lexIDs_with_mismatches.items() if len(v) > 1}

In [40]:
# strongs_data = check_strongs()

In [41]:
# Print the mismatch data.
def display_sn_mismatches():
    sorted_data = dict(sorted(strongs_data.items(), key=lambda t: int(t[0].split(' ')[0])))
    nodes_sn = [(k, [v for v in data[k]]) for k in sorted_data]
    for nsn in nodes_sn:
        lex_id, sns = nsn
        print(f"Lex_id {lex_id}: {sns}")

#### Notes
876 of the lexemes have strong number mismatches in OHB_EXTENDED. Most of these consist of a lexeme that appears with 2-3 other strong numbers, but a few have many mismatches, e.g.:

Lex_id 363 שַׁ: ['H1571', 'H6965', 'H7945', 'H859', 'H5921', 'H1121', 'H2266', 'H8033', 'H2603', 'H1961', 'H3808', 'H6927', 'H3381', 'H5975', 'H5221', 'H8216', 'H270', 'H369', 'H410', 'H4428', 'H2654', 'H6315', 'H157', 'H8010', 'H5849', 'H1570', 'H7218', 'H2470', 'H3602', 'H1696', 'H5998', 'H5158', 'H559', 'H6213', 'H935', 'H3426', 'H4745', 'H3528', 'H1931', 'H398', 'H2896', 'H3372', 'H5307', 'H5087', 'H3117', 'H4191', 'H3318', 'H2111', 'H6960', 'H1992', 'H8074', 'H5414', 'H5973', 'H3754']

Sometimes a lexeme will be assigned its suffix value rather than its actual value. Consider the following example:

Lex_id 1 בְּ: ['H9003', 'H2004', 'H5221', 'H2657', 'H5674', 'H8055']

H9003 (in/on/with) appears as בָּהֵ֖ן at node 9058 and gets assigned H2004 הֵן (they, fem.) because of the suffix. In this case Logos renders just בְּ and STEP has both בְּ as H9003 and הֶן as H9039 (Op3f, them).

Other instances are quite mistaken. For example, at node 335242 (Ps 145:1), OHB_EXTENDED assigns lex 4 (אֱלֹהִים) H433 (false God) where STEP rightly keeps it as H430. 

## Keep Features / Data Structure

Feature: Origin (origin column name)

WORD TABLE
- BHSA-aligned id: BHSA_DB (_id), excluding/skipping node 16563
- Consonantal text: OHB_EXTENDED (word_cons), unpointed sin/shin as in Sefaria & elsewhere
- Pointed text: OHB_EXTENDED (word)
- Trailer: OHB_EXTENDED (trailer)
- Lexeme id (FK): OHB_EXTENDED (lex_id)
- Gloss (BSB or LEB): OHB_EXTENDED
- Part of Speech: BHSA_DB (sp)
- Person: BHSA_DB (ps)
- Number: BHSA_DB (nu)
- Gender: BHSA_DB (gn)
- Verb Tense: BHSA_DB (vt)
- Verb Stem: BHSA_DB (vs)
- State: BHSA_DB (st)
- Pronoun suffix number: BHSA_DB (prs_nu)
- Pronoun suffix gender: BHSA_DB (prs_gn)
- Pronoun suffix person: BHSA_DB (prs_ps)
- **SUFFIX** BHSA_DB (g_prs_utf8) -- not quite right, but can be edited.
- BHSA-phrase id (FK): TF API
- BHSA-clause id (FK): TF API
- BHSA_clause_atom id (FK): TF API
- BHSA-sentence id: BH4C_DB (sentence_node_id)
- BHS b:ch:v: OHB_EXTENDED
- KJV b:ch:v: OHB_EXTENDED
- Freq occurrence: BHSA_DB (freq_occ)
- Rank occurrence: BHSA_DB (rank_occ)

LEXEME TABLE
- Lexeme id: OHB_EXTENDED (lex_id)
- Lexeme: OHB_EXTENDED (lex)
- Freq lex: BHSA_DB (freq_lex)
- Rank lex: BHSA_DB (rank_lex)
- Name type: BHSA_DB (nametype), used for proper nouns, w/ caution
- Strong's (FK): OHB_EXTENDED (strongs)
- Gloss (BHSA): OHB_EXTENDED 
- Gloss (STEP): 

PHRASE TABLE
- Phrase id: BHSA_DB (_id)
- Determined: BHSA_DB (det)
- Function: BHSA_DB (function)
- Number: BHSA_DB (number)
- Type: BHSA_DB (typ)

CLAUSE TABLE
- Clause id: BHSA_DB (_id)
- Domain: BHSA_DB (domain)
- Kind: BHSA_DB (kind)
- Number: BHSA_DB (number), pos in sentence
- Relation: BHSA_DB (rela)
- Type: typ

CLAUSE ATOM TABLE
- Clause atom id: BHSA_DB (_id)
- Code: BHSA_DB (code)
- Paragraph: BHSA_DB (pargr)
- Tab: BHSA_DB (tab)
- Type: BHSA_DB (typ)

BOOK TABLE
- Book id: BHSA_DB (_id)
- OSIS abbrev: BHSA_DB (OSIS)
- LEB abbrev: BHSA_DB (LEB)
- Name: Self
- Tanakh ordering: Self

# Add Data from BHSA_DB and TF API

In [2]:
A = use('bhsa', hoist=globals(), checkout='local', version='c')

This is Text-Fabric 9.1.1
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

120 features found and 0 ignored


In [75]:
# Load the aligned file that was cleaned
OHB_ALIGNED_PATH = '../data_files/combined/ohb_combined_aligned.csv'
OHB_ALIGNED = pd.read_csv(OHB_ALIGNED_PATH, sep=',', low_memory=False)

In [76]:
SKIP = 16563

words = [n for n in F.otype.s('word') if n != SKIP]
LEX_IDS = [L.u(i, otype='lex')[0] for i in words]
PHRASE_IDS = [L.u(i, otype='phrase')[0] for i in words]
CLAUSE_ATOM_IDS = [L.u(i, otype='clause_atom')[0] for i in words]
CLAUSE_IDS = [L.u(i, otype='clause')[0] for i in words]
SENTENCE_IDS = [L.u(i, otype='sentence')[0] for i in words]

In [77]:
# Test data
data_map = {
    'lex': LEX_IDS,
    'phrase': PHRASE_IDS,
    'clause_atom': CLAUSE_ATOM_IDS,
    'clause': CLAUSE_IDS,
    'sentence': SENTENCE_IDS
}
ohb_len = len(OHB_ALIGNED.index)
for k in data_map:
    data = data_map[k]
    data_len = len(data)
    start_node = data[0]
    end_node = data[-1]
    start_tf = F.otype.s(k)[0]
    end_tf = F.otype.s(k)[-1]
    if data_len == ohb_len \
    and start_node == start_tf \
    and end_node == end_tf:
        print(f"{k} data looks good!")
    else:
        # 'lex' will fail because of the last node, but it IS accurate.
        print(f"{k} data doesn't look right...")
        print(f"Len: {data_len}-{ohb_len} Start: {start_node}-{start_tf} End: {end_node}-{end_tf}")

lex data doesn't look right...
Len: 426583-426583 Start: 1437567-1437567 End: 1437689-1446799
phrase data looks good!
clause_atom data looks good!
clause data looks good!
sentence data looks good!


In [78]:
# Drop the SKIP row to import data into the new DB with alignment.
display(HTML(BHSA_DB_DF.loc[SKIP-2:SKIP].to_html(index=False)))
BHSA_V2=BHSA_DB_DF.drop(BHSA_DB_DF.index[SKIP-1])
BHSA_V2.reset_index(drop=True, inplace=True)
display(HTML(BHSA_V2.loc[SKIP-2:SKIP].to_html(index=False)))

_id,freq_lex,freq_occ,g_cons,g_cons_utf8,g_lex,g_lex_utf8,g_nme,g_nme_utf8,g_pfm,g_pfm_utf8,g_prs,g_prs_utf8,g_uvf,g_uvf_utf8,g_vbe,g_vbe_utf8,g_vbs,g_vbs_utf8,g_word,g_word_utf8,gloss,gn,kq_hybrid,kq_hybrid_utf8,language,languageISO,lex,lex0,lex_utf8,lexeme_count,ls,nametype,nme,nu,number,pdp,pfm,prs,prs_gn,prs_nu,prs_ps,ps,qere,qere_trailer,qere_trailer_utf8,qere_utf8,rank_lex,rank_occ,sp,st,suffix_gender,suffix_number,suffix_person,trailer,trailer_utf8,uvf,vbe,vbs,voc_lex,voc_lex_utf8,vs,vt
16562,15542,14194,B,ב,B.A-,בַּ,,,,,,,,,,,,,B.A-,בַּ,in,,,,Hebrew,hbo,B,B,ב,0,none,,,,16562,prep,,absent,unknown,unknown,unknown,,,,,,3,3,prep,,unknown,unknown,unknown,,,absent,,,B.:,בְּ,,
16563,30386,6487,,,-,,,,,,,,,,,,,,-,,the,,,,Hebrew,hbo,H,H,ה,0,none,,,,16563,art,,,,,,,,,,,1,6,art,,,,,,,absent,,,HA,הַ,,
16564,65,35,XLWM,חלום,X:ALOWM,חֲלֹום,/,֜,,,,,,,,,,,X:ALO73WM,חֲלֹ֖ום,dream,m,,,Hebrew,hbo,XLWM/,XLWM,חלום,0,none,,,sg,16564,subs,,absent,unknown,unknown,unknown,,,,,,586,1099,subs,a,unknown,unknown,unknown,,,absent,,,X:ALOWM,חֲלֹום,,


_id,freq_lex,freq_occ,g_cons,g_cons_utf8,g_lex,g_lex_utf8,g_nme,g_nme_utf8,g_pfm,g_pfm_utf8,g_prs,g_prs_utf8,g_uvf,g_uvf_utf8,g_vbe,g_vbe_utf8,g_vbs,g_vbs_utf8,g_word,g_word_utf8,gloss,gn,kq_hybrid,kq_hybrid_utf8,language,languageISO,lex,lex0,lex_utf8,lexeme_count,ls,nametype,nme,nu,number,pdp,pfm,prs,prs_gn,prs_nu,prs_ps,ps,qere,qere_trailer,qere_trailer_utf8,qere_utf8,rank_lex,rank_occ,sp,st,suffix_gender,suffix_number,suffix_person,trailer,trailer_utf8,uvf,vbe,vbs,voc_lex,voc_lex_utf8,vs,vt
16562,15542,14194,B,ב,B.A-,בַּ,,,,,,,,,,,,,B.A-,בַּ,in,,,,Hebrew,hbo,B,B,ב,0,none,,,,16562,prep,,absent,unknown,unknown,unknown,,,,,,3,3,prep,,unknown,unknown,unknown,,,absent,,,B.:,בְּ,,
16564,65,35,XLWM,חלום,X:ALOWM,חֲלֹום,/,֜,,,,,,,,,,,X:ALO73WM,חֲלֹ֖ום,dream,m,,,Hebrew,hbo,XLWM/,XLWM,חלום,0,none,,,sg,16564,subs,,absent,unknown,unknown,unknown,,,,,,586,1099,subs,a,unknown,unknown,unknown,,,absent,,,X:ALOWM,חֲלֹום,,
16565,349,345,J<QB,יעקב,JA<:AQOB,יַעֲקֹב,/,֜,,,,,,,,,,,JA95<:AQO92B,יַֽעֲקֹ֑ב,Jacob,m,,,Hebrew,hbo,J<QB/,J<QB,יעקב,0,none,pers,,sg,16565,nmpr,,,,,,,,,,,143,109,nmpr,a,,,,,,absent,,,JA<:AQOB,יַעֲקֹב,,


In [79]:
# Load in tables from the BHSA DB
LEX_DF = pd.read_sql_query("SELECT * FROM lex", BHSA_DB_CON)
PHRASE_DF = pd.read_sql_query("SELECT * FROM phrase", BHSA_DB_CON)
CLAUSE_DF = pd.read_sql_query("SELECT * FROM clause", BHSA_DB_CON)
CLAUSE_ATOM_DF = pd.read_sql_query("SELECT * FROM clause_atom", BHSA_DB_CON)

In [81]:
WORD_COLS = {
    'wordId': OHB_ALIGNED['BHSwordSort'],
    'book': OHB_ALIGNED['KJVbook'],
    'chKJV': OHB_ALIGNED['KJVchapter'],
    'vsKJV': OHB_ALIGNED['KJVverse'],
    'vsIdKJV': OHB_ALIGNED['KVJvsNode'],
    'chBHS': OHB_ALIGNED['BHSchapter'],
    'vsBHS': OHB_ALIGNED['BHSverse'],
    'vsIdBHS': OHB_ALIGNED['BHSvsNode'],
    # 'lang': OHB_ALIGNED['lang'], Get this from Lex
    # 'speech': BHSA_V2['sp'], Get this from Lex
    'person': BHSA_V2['ps'],
    'gender': BHSA_V2['gn'],
    'number': BHSA_V2['nu'],
    'vTense': BHSA_V2['vt'],
    'vStem': BHSA_V2['vs'],
    'state': BHSA_V2['st'],
    'prsPerson': BHSA_V2['prs_ps'],
    'prsGender': BHSA_V2['prs_gn'],
    'prsNumber': BHSA_V2['prs_nu'],
    'suffix': BHSA_V2['g_prs_utf8'],
    'text': OHB_ALIGNED['BHSwordPointed'],
    'textCons': OHB_ALIGNED['BHSwordConsonantal'],
    'trailer': OHB_ALIGNED['Trailer'],
    'transliteration': OHB_ALIGNED['SBLstyleTransliteration'],
    'glossExt': OHB_ALIGNED['extendedGloss'],
    'glossBSB': OHB_ALIGNED['BSBgloss'],
    'sortBSB': OHB_ALIGNED['BSBglossNode'],
    'strongsId': OHB_ALIGNED['extendedStrongNumber'],
    'lexId': LEX_IDS,
    'phraseId': PHRASE_IDS,
    'clauseAtomId': CLAUSE_ATOM_IDS,
    'clauseId': CLAUSE_IDS,
    'sentenceId': SENTENCE_IDS,
    'freqOcc': BHSA_V2['freq_occ'],
    'rankOcc': BHSA_V2['rank_occ'],
    'poetryMarker': OHB_ALIGNED['poetryMarker'],
    'parMarker': OHB_ALIGNED['paragraphMarker'],
}

LEX_COLS = {
    'lexId': LEX_DF['_id'],
    'language': [{'Hebrew':'hbo','Aramaic':'arc'}[k] for k in LEX_DF['language']],
    'speech': LEX_DF['sp'],
    'nameType': LEX_DF['nametype'],
    'lexSet': LEX_DF['ls'],
    'lexText': LEX_DF['voc_lex_utf8'],
    'gloss': LEX_DF['gloss'],
    'freqLex': LEX_DF['freq_lex'],
    'rankLex': LEX_DF['rank_lex'],
    # 'strongs': ...
    # 'gloss_STEP': ...
}

PHRASE_COLS = {
    'phraseId': PHRASE_DF['_id'],
    'determined': PHRASE_DF['det'],
    'function': PHRASE_DF['function'],
    'phraseNumber': PHRASE_DF['number'], # position in phrase
    'phraseType': PHRASE_DF['typ'],
}

CLAUSE_COLS = {
    'clauseId': CLAUSE_DF['_id'],
    'domain': CLAUSE_DF['domain'],
    'kind': CLAUSE_DF['kind'],
    'clauseNumber': CLAUSE_DF['number'], # position in sentence
    'relation': CLAUSE_DF['rela'],
    'clauseType': CLAUSE_DF['typ']
}

CLAUSE_ATOM_COLS = {
    'clauseAtomId': CLAUSE_ATOM_DF['_id'],
    'code': CLAUSE_ATOM_DF['code'],
    'paragraph': CLAUSE_ATOM_DF['pargr'],
    'tab': CLAUSE_ATOM_DF['tab'],
    'clauseAtomType': CLAUSE_ATOM_DF['typ'],
}

STRONGS_COLS = {
    'strongsId': TBESH_DF['strongs'],
    'lexeme': TBESH_DF['lex'],
    'transliterationSTEP': TBESH_DF['transliteration'],
    'morphCode': TBESH_DF['morph'],
    'glossSTEP': TBESH_DF['gloss'],
    'definition': TBESH_DF['definition']
}

In [82]:
book_path = '../data_files/books.csv'
lex_sentences_path = '../data_files/lex_sentences.csv'
passages_path = '../data_files/passages.csv'
WORD_TABLE = pd.DataFrame(WORD_COLS, index=None)
LEX_TABLE = pd.DataFrame(LEX_COLS, index=None)
PHRASE_TABLE = pd.DataFrame(PHRASE_COLS, index=None)
CLAUSE_TABLE = pd.DataFrame(CLAUSE_COLS, index=None)
CLAUSE_ATOM_TABLE = pd.DataFrame(CLAUSE_ATOM_COLS, index=None)
LEX_SENTENCE_TABLE = pd.read_csv(lex_sentences_path, sep='\t')
STRONGS_TABLE = pd.DataFrame(STRONGS_COLS, index=None)
PASSAGE_TABLE = pd.read_csv(passages_path, sep='\t') # Generated in tf_passage_weights_v3
BOOK_TABLE = pd.read_csv(book_path, sep=',', low_memory=False)

In [None]:
# display(HTML(WORD_TABLE.tail().to_html(index=False)))
display(HTML(WORD_TABLE.loc[SKIP-3:SKIP+2].to_html(index=False)))

wordId,book,chKJV,vsKJV,vsIdKJV,chBHS,vsBHS,vsIdBHS,person,gender,number,vTense,vStem,state,prsPerson,prsGender,prsNumber,suffix,text,textCons,trailer,transliteration,glossExt,glossBSB,sortBSB,strongsId,lexId,phraseId,clauseAtomId,clauseId,sentenceId,freqOcc,rankOcc,poetryMarker,parMarker
16561.0,1.0,31.0,11.0,885.0,31.0,11.0,885.0,,m,pl,,,a,unknown,unknown,unknown,,אֱלֹהִ֛ים,אלהים,,ʾĕlōhîm,god [pl.],of God,11455.0,H430,1437570,661699,519117,430891,1174875,1177,31,,
16562.0,1.0,31.0,11.0,885.0,31.0,11.0,885.0,,,,,,,unknown,unknown,unknown,,בַּ,ב,,ba,in,In,11452.0,H9003,1437567,661700,519117,430891,1174875,14194,3,,
16563.0,1.0,31.0,11.0,885.0,31.0,11.0,885.0,,m,sg,,,a,unknown,unknown,unknown,,חֲלֹ֖ום,חלום,,ḥălôm,dream,that dream,11453.0,H2472,1438481,661700,519117,430891,1174875,35,1099,,
16564.0,1.0,31.0,11.0,885.0,31.0,11.0,885.0,,m,sg,,,a,,,,,יַֽעֲקֹ֑ב,יעקב,,yaʿăqōb,Jacob,‘Jacob!’,11458.0,H3290,1438668,661701,519118,430892,1174876,345,109,,
16565.0,1.0,31.0,11.0,885.0,31.0,11.0,885.0,,,,,,,,,,,וָ,ו,,wā,and,And,11459.0,H9000,1437574,661702,519119,430893,1174877,50238,0,,
16566.0,1.0,31.0,11.0,885.0,31.0,11.0,885.0,p1,unknown,sg,wayq,qal,,unknown,unknown,unknown,,אֹמַ֖ר,אמר,,ʾōmar,[I]+ say,"I replied,",11460.0,H559,1437586,661703,519119,430893,1174877,1911,20,,


## Importing the Dataframe Into a SQL Database

In [83]:
table_map = {
    'word': WORD_TABLE,
    'lex': LEX_TABLE,
    'phrase': PHRASE_TABLE,
    'clause': CLAUSE_TABLE,
    'clauseAtom': CLAUSE_ATOM_TABLE,
    'lexSentence': LEX_SENTENCE_TABLE,
    'strongs': STRONGS_TABLE,
    'passage': PASSAGE_TABLE,
    'book': BOOK_TABLE
}

In [84]:
# Define the data types for SQL.
from sqlalchemy.types import Integer, Text, Float
type_map = {
    'word':
        {'wordId': Integer(), 'book': Integer(), 'chKJV': Integer(), 'vsKJV': Integer(), 'vsIdKJV': Integer(), 'chBHS': Integer(), 'vsBHS': Integer(), 'vsIdBHS': Integer(), 
        'lang': Text(), 'speech': Text(), 'person': Text(), 'gender': Text(), 'number': Text(), 'vTense': Text(), 'vStem': Text(), 'state': Text(), 
        'prsPerson': Text(), 'prsGender': Text(), 'prsNumber': Text(), 'suffix': Text(), 'text': Text(), 'textCons':Text(), 'trailer': Text(), 'transliteration': Text(), 
        'glossExt': Text(), 'glossBSB': Text(), 'sortBSB': Float(), 'strongs': Text(), 'lexId': Integer(), 'phraseId': Integer(), 'clauseAtomId': Integer(), 
        'clauseId': Integer(), 'sentenceId': Integer(), 'freqOcc': Integer(), 'rankOcc': Integer(), 'poetryMarker': Text(), 'parMarker': Text()},
    'lex':
        {'lexId': Integer(), 'language': Text(), 'lexSpeech': Text(), 'nameType': Text(), 'lexSet': Text(), 
        'lexText': Text(), 'gloss': Text(), 'freqLex': Integer(), 'rankLex': Integer()},
    'phrase':
        {'phraseId': Integer(), 'determined': Text(), 'function': Text(), 'phraseNumber': Integer(), 'phraseType': Text()},
    'clause':
        {'clauseId': Integer(), 'domain':Text(), 'kind': Text(), 'clauseNumber': Integer(), 'relation': Text(), 'clauseType': Text()},
    'clauseAtom':
        {'clauseAtomId': Integer(), 'code': Integer(), 'paragraph': Text(), 'tab': Integer(), 'clauseAtomType': Text()},
    'strongs':
        {'strongsId': Text(), 'lexeme': Text(), 'transliterationSTEP': Text(), 'morphCode': Text(), 'glossSTEP': Text(), 'definition': Text()},
    'lexSentence':
        {'lexId': Integer(), 'sentenceId': Integer(), 'sentenceWeight': Float()},
    'passage':
        {'passageId': Integer(), 'wordCount': Integer(), 'weight': Float(), 'startVsNode': Integer(), 'endVsNode': Integer()},
    'book':
        {'bookId': Integer(), 'chapters': Integer(), 'abbrOSIS': Text(), 'abbrLEB': Text(), 'bookName': Text(), 'bookNameHeb': Text(), 'tanakhSort': Text()}
}

In [85]:
from sqlalchemy import create_engine
sql_file = '../data_files/bhsa4c_custom.db'
# https://docs.sqlalchemy.org/en/14/core/engines.html
# sqlite://<nohostname>/<path> where <path> is relative:
con = create_engine(f"sqlite:///{sql_file}")
# Convert the dataframes to tables in the database 
for table in table_map:
    table_map[table].to_sql(
        table, 
        con=con, 
        if_exists='replace', 
        index=False,
        dtype=type_map[table]
    )