# OpenHebrewBible (OHB) CSV Data to SQL Conversion V2

Eliran Wong used date from [ETCBC](https://github.com/ETCBC/bhsa) (Hebrew text BHSA, linguistic annotations, morphology, etc.), [OpenScriptures](https://github.com/openscriptures/morphhb) (Hebrew text WLC, Strong's numbers, morphology, etc.), and [Berean.bible](https://berean.bible) (interlinear translation, Berean Study Bible, etc.) to create a robust data repository called [OpenHebrewBible](https://github.com/eliranwong/OpenHebrewBible), consisting of CSV files that bridge the other three open-source projects.

I will take his compiled data file, [BHSA-with-extended-features.csv](https://github.com/eliranwong/OpenHebrewBible/blob/master/BHSA-with-extended-features.csv.zip), clean it, and convert it into a SQL database that I can use in my Flutter app. He also converted the BHSA TF 4c word data into a SQL DB file, [ETCBC4c.db](https://github.com/eliranwong/ETCBC-recycle/blob/master/sqlite3/ETCBC4c.db.zip). I will convert both the CSV and DB files to dataframes and combine useful data into a new dataframe. Later I will compare the converted combined dataframe to a BHSA SQL file, which can be downloaded [here](https://www.adambaker.org/bhsa.sqlite). Finally, I will convert the dataframe to a SQL database (after testing its data). 

**KEY**
- OHB_EXTENDED: [BHSA-with-extended-features.csv](https://github.com/eliranwong/OpenHebrewBible/blob/master/BHSA-with-extended-features.csv.zip)
- OHB_DB: [ETCBC4c.db](https://github.com/eliranwong/ETCBC-recycle/blob/master/sqlite3/ETCBC4c.db.zip)
- BHSA_DB: [bhsa.sqlite](https://www.adambaker.org/bhsa.sqlite)

### Why use OHB data when BHSA already exists?
Some features that could be useful from OHB data that aren't present in BHSA are:
- Strong's number mapped to each node in the BHS.
- Data to align the BHS text with KJV and BSB translations.
- Poetic divisions (not tested).
- BSB gloss for a more accurate rendering of each word. 

## Imports

In [None]:
# Requirements: run in terminal. Change to 'pip' if on Windows OS. 
"""
pip3 install pandas
pip3 install numpy
pip3 install text-fabric
pip3 install jupyter
"""

In [1]:
import pandas as pd
import sqlite3
import numpy as np
import copy
from tf.app import use
from IPython.display import display, HTML

pd.set_option('display.max_columns', None)

## Constants

In [62]:
# Files
BHSA_DB = '../data_files/bhsa.sqlite'
OHB_DB = '../data_files/ETCBC4c.db'
OHB_EXTENDED = '../data_files/BHSA-with-extended-features.csv'

# Col Names
COL_NAMES = {
    'BHSwordSort': 'bhs_id',
    'paragraphMarker': 'par_marker',
    'poetryMarker': 'poem_marker',
    'KVJvsNode': 'kjv_vs_id',
    'KJVbook': 'kjv_bk',
    'KJVchapter': 'kjv_ch',
    'KJVverse': 'kjv_vs',
    'BHSvsNode': 'bhs_vs_id',
    'BHSbook': 'bhs_bk',
    'BHSchapter': 'bhs_ch',
    'BHSverse': 'bhs_vs',
    'clauseID': 'clause_id',
    'clauseKind': 'clause_kind',
    'clauseType': 'clause_type',
    'language': 'lang',
    'BHSwordPointed': 'word',
    'BHSwordConsonantal': 'word_cons',
    'Trailer': 'trailer',
    'SBLstyleTransliteration': 'transliteration',
    'HebrewLexeme': 'lex',
    'lexemeID': 'lex_id',
    'extendedStrongNumber': 'strongs',
    'morphologyCode': 'morph_code',
    'morphologyDetail': 'morph_detail',
    'ETCBCgloss': 'bhsa_gloss',
    'extendedGloss': 'extended_gloss',
    'BSBgloss': 'bsb_gloss',
    'BSBglossNode': 'bsb_pos'    
}

In [6]:
# DB Connections and Dataframes
BHSA_DB_CON = sqlite3.connect(BHSA_DB)
BHSA_DB_DF = pd.read_sql_query("SELECT * FROM word", BHSA_DB_CON)
print('BHSA_DB Data loaded')

OHB_DB_CON = sqlite3.connect(OHB_DB)
OHB_DB_DF = pd.read_sql_query("SELECT * FROM data", OHB_DB_CON)
print('OHB_DB Data loaded')

# Set low_memory to False to deal with unexpected data types. 
# Converts those data to NaN.
OHB_EXTENDED_DF = pd.read_csv(OHB_EXTENDED, sep='\t', low_memory=False)
print('OHB_EXTENDED Data loaded')

BHSA_DB Data loaded
OHB_DB Data loaded
OHB_EXTENDED Data loaded


## The Data

The BHSA_DB file consists of all the TF word data from the BHSA dataset -- every word in the BHS Hebrew Old Testament as a node with dozens of features.

The OHB_DB file consists of some of that data along with KJV verse and chapter alignment. 

The OHB_EXTENDED file consists of some of the BHSA data along with added features. It has 22 feature columns and uses tab-separated delineation. All of the data consists of strings or positive integers. 

You can view the three dataframes below.

In [139]:
# BHSA_DB Word Data
display(HTML(BHSA_DB_DF.head(n=5).to_html(index=False)))

_id,freq_lex,freq_occ,g_cons,g_cons_utf8,g_lex,g_lex_utf8,g_nme,g_nme_utf8,g_pfm,g_pfm_utf8,g_prs,g_prs_utf8,g_uvf,g_uvf_utf8,g_vbe,g_vbe_utf8,g_vbs,g_vbs_utf8,g_word,g_word_utf8,gloss,gn,kq_hybrid,kq_hybrid_utf8,language,languageISO,lex,lex0,lex_utf8,lexeme_count,ls,nametype,nme,nu,number,pdp,pfm,prs,prs_gn,prs_nu,prs_ps,ps,qere,qere_trailer,qere_trailer_utf8,qere_utf8,rank_lex,rank_occ,sp,st,suffix_gender,suffix_number,suffix_person,trailer,trailer_utf8,uvf,vbe,vbs,voc_lex,voc_lex_utf8,vs,vt
1,15542,14194,B,ב,B.:-,בְּ,,,,,,,,,,,,,B.:-,בְּ,in,,,,Hebrew,hbo,B,B,ב,0,none,,,,1,prep,,absent,unknown,unknown,unknown,,,,,,3,3,prep,,unknown,unknown,unknown,,,absent,,,B.:,בְּ,,
2,51,45,R>CJT,ראשׁית,R;>CIJT,רֵאשִׁית,/,֜,,,,,,,,,,,R;>CI73JT,רֵאשִׁ֖ית,beginning,f,,,Hebrew,hbo,R>CJT/,R>CJT,ראשׁית,0,none,,,sg,2,subs,,absent,unknown,unknown,unknown,,,,,,706,868,subs,a,unknown,unknown,unknown,,,absent,,,R;>CIJT,רֵאשִׁית,,
3,48,15,BR>,ברא,B.@R@>,בָּרָא,,,,,,,,,[,,,,B.@R@74>,בָּרָ֣א,create,m,,,Hebrew,hbo,BR>[,BR>,ברא,0,none,,absent,sg,3,verb,absent,absent,unknown,unknown,unknown,p3,,,,,745,2341,verb,,unknown,unknown,unknown,,,absent,,absent,BR>,ברא,qal,perf
4,2601,1177,>LHJM,אלהים,>:ELOH,אֱלֹה,/IJM,ִ֜ים,,,,,,,,,,,>:ELOHI92JM,אֱלֹהִ֑ים,god(s),m,,,Hebrew,hbo,>LHJM/,>LHJM,אלהים,0,none,,JM,pl,4,subs,,absent,unknown,unknown,unknown,,,,,,18,31,subs,a,unknown,unknown,unknown,,,absent,,,>:ELOHIJM,אֱלֹהִים,,
5,10989,9743,>T,את,>;T,אֵת,,,,,,,,,,,,,>;71T,אֵ֥ת,<object marker>,,,,Hebrew,hbo,>T,>T,את,0,none,,,,5,prep,,absent,unknown,unknown,unknown,,,,,,4,4,prep,,unknown,unknown,unknown,,,absent,,,>;T,אֵת,,


In [82]:
# OHB_DB Data
display(HTML(OHB_DB_DF.head(n=5).to_html(index=False)))

word_ID,Book,ch_BHS,v_BHS,ch_KJV,v_KJV,manuscript,transliteration,lex_Hebrew,lex_number,gloss_Eng,lang,lang_def,morph_pdp,morph_pdp_def,morph_sp,morph_sp_def,morph_vs,morph_vs_def,morph_vt,morph_vt_def,morph_ps,morph_ps_def,morph_gn,morph_gn_def,morph_nu,morph_nu_def,morph_st,morph_st_def,prs_ps,prs_ps_def,prs_gn,prs_gn_def,prs_nu,prs_nu_def,clause_markers,clause_kind,clause_typ,clause_rela,phrase_markers,phrase_typ,phrase_rela,phrase_det,phrase_function
1,Gen,1,1,1,1,בְּ,bᵊ,בְּ,L70001,in,hbo,Ancient Hebrew,prep,preposition,prep,preposition,,not applicable,,not applicable,,not applicable,,not applicable,,not applicable,,not applicable,unknown,unknown,unknown,unknown,unknown,unknown,「,Verbal clauses,x-qatal-X clause,,『,Prepositional phrase,,undetermined,Time reference
2,Gen,1,1,1,1,רֵאשִׁ֖ית,rēšˌîṯ,רֵאשִׁית,L70002,beginning,hbo,Ancient Hebrew,subs,noun,subs,noun,,not applicable,,not applicable,,not applicable,f,feminine,sg,singular,a,absolute,unknown,unknown,unknown,unknown,unknown,unknown,,Verbal clauses,x-qatal-X clause,,』,Prepositional phrase,,undetermined,Time reference
3,Gen,1,1,1,1,בָּרָ֣א,bārˈā,ברא,L70003,create,hbo,Ancient Hebrew,verb,verb,verb,verb,qal,qal,perf,perfect,p3,third person,m,masculine,sg,singular,,not applicable,unknown,unknown,unknown,unknown,unknown,unknown,,Verbal clauses,x-qatal-X clause,,『』,Verbal phrase,,,Predicate
4,Gen,1,1,1,1,אֱלֹהִ֑ים,ʔᵉlōhˈîm,אֱלֹהִים,L70004,god(s),hbo,Ancient Hebrew,subs,noun,subs,noun,,not applicable,,not applicable,,not applicable,m,masculine,pl,plural,a,absolute,unknown,unknown,unknown,unknown,unknown,unknown,,Verbal clauses,x-qatal-X clause,,『』,Nominal phrase,,undetermined,Subject
5,Gen,1,1,1,1,אֵ֥ת,ʔˌēṯ,אֵת,L70005,[object marker],hbo,Ancient Hebrew,prep,preposition,prep,preposition,,not applicable,,not applicable,,not applicable,,not applicable,,not applicable,,not applicable,unknown,unknown,unknown,unknown,unknown,unknown,,Verbal clauses,x-qatal-X clause,,『,Prepositional phrase,,determined,Object


In [10]:
# OHB_EXTENDED Data
display(HTML(OHB_EXTENDED_DF.head(n=5).to_html(index=False)))

BHSwordSort,paragraphMarker,poetryMarker,〔KJVverseSort｜KJVbook｜KJVchapter｜KJVverse〕,〔BHSverseSort｜BHSbook｜BHSchapter｜BHSverse〕,clauseID,clauseKind,clauseType,language,BHSwordPointed,BHSwordConsonantal,SBLstyleTransliteration,poneticTranscription,HebrewLexeme,lexemeID,StrongNumber,extendedStrongNumber,morphologyCode,morphologyDetail,ETCBCgloss,extendedGloss,〔BSBsort＠BSB〕
1,¶,,〔1｜1｜1｜1〕,〔1｜1｜1｜1〕,c1,Verbal clauses,x-qatal-X clause,Hebrew,<heb>בְּ</heb><heb></heb>,<heb>ב</heb><heb></heb>,bĕ,bᵊ,<heb>בְּ</heb>,E70001,,H9003,prep,preposition,in,in,〔1＠In〕
2,,,〔1｜1｜1｜1〕,〔1｜1｜1｜1〕,c1,Verbal clauses,x-qatal-X clause,Hebrew,<heb>רֵאשִׁ֖ית</heb><heb> </heb>,<heb>ראשית</heb><heb> </heb>,rēšît,rēšˌîṯ,<heb>רֵאשִׁית</heb>,E70002,H7225,H7225,subs.f.sg.a,"noun, feminine, singular, absolute",beginning,beginning,〔2＠the beginning〕
3,,,〔1｜1｜1｜1〕,〔1｜1｜1｜1〕,c1,Verbal clauses,x-qatal-X clause,Hebrew,<heb>בָּרָ֣א</heb><heb> </heb>,<heb>ברא</heb><heb> </heb>,bārā,bārˈā,<heb>ברא</heb>,E70003,H1254,H1254,verb.qal.perf.p3.m.sg,"verb, qal, perfect, third person, masculine, singular",create,[he]+ create,〔4＠created〕
4,,,〔1｜1｜1｜1〕,〔1｜1｜1｜1〕,c1,Verbal clauses,x-qatal-X clause,Hebrew,<heb>אֱלֹהִ֑ים</heb><heb> </heb>,<heb>אלהים</heb><heb> </heb>,ʾĕlōhîm,ʔᵉlōhˈîm,<heb>אֱלֹהִים</heb>,E70004,H430,H430,subs.m.pl.a,"noun, masculine, plural, absolute",god(s),god [pl.],〔3＠God〕
5,,,〔1｜1｜1｜1〕,〔1｜1｜1｜1〕,c1,Verbal clauses,x-qatal-X clause,Hebrew,<heb>אֵ֥ת</heb><heb> </heb>,<heb>את</heb><heb> </heb>,ʾēt,ʔˌēṯ,<heb>אֵת</heb>,E70005,H853,H853,prep,preposition,[object marker],[object marker],


## Keep Features

Feature: Origin (origin column name)
- BHSA-aligned id:
- Consonantal text:
- Pointed text
- Lexeme
- Gloss (BHSA): OHB_EXTENDED
- Gloss (STEP): 
- Gloss (BSB or LEB): OHB_EXTENDED
- Part of Speech
- Person
- Number
- Gender
- Tense
- Stem
- State
- Pronoun suffix number
- Pronoun suffix gender
- Pronoun suffix person
- BHSA-phrase id
- BHSA-clause id
- BHSA-sentence id
- BHS b:ch:v: OHB_EXTENDED
- KJV b:ch:v: OHB_EXTENDED



## Cleaning the Data (OHB_EXTENDED)

### Data to Remove

- poneticTranscription: doesn't add any useful information.
- StrongNumber: doesn't add any useful information.

In [64]:
OHB_EXTENDED_DF.drop(columns=['poneticTranscription', 'StrongNumber'], inplace=True)

### Data to clean:
```
- 〔KJVverseSort｜KJVbook｜KJVchapter｜KJVverse〕: remove chars and place ints in new columns
- 〔BHSverseSort｜BHSbook｜BHSchapter｜BHSverse〕: remove chars and place ints in new columns
-  clauseID : remove 'c' prefix and convert to int
-  BHSwordPointed : remove html tags, place word in list - trailer in new list -> new column
-  BHSwordConsonantal : remove html tags, place word in list
-  HebrewLexeme : remove html tags
- 〔BSBsort＠BSB〕: remove chars and place int and string in new columns
```

### 1. Clean Text and Clause Data

**Important note:** certain nodes do not have a text value for *BHSwordPointed* or *BHSwordConsonantal* because of the nature of the Hebrew language. For example at BHS word node values 61-62 we have:
```
61  <heb>לָ</heb><heb></heb>     <heb>ל</heb><heb></heb>     <heb>לְ</heb>    H9005   prep    to	
62  <heb></heb><heb></heb>      <heb></heb><heb></heb>      <heb>הַ</heb>    H9009   art     the	〔51＠the〕
```
This is from a clause in Genesis 5:1, with the Hebrew: וַיִּקְרָא אֱלֹהִים לָאוֹר יוֹם

Node 62 is embedded into the word לָאוֹר, attached to the preposition via a patach, but the *he* (the) doesn't appear consonantly in the text.


##### Visualize pre-cleaned data

In [65]:
# Data before being cleaned.
display(HTML(
    OHB_EXTENDED_DF[
        ["BHSwordPointed", 
        "BHSwordConsonantal", 
        "HebrewLexeme", 
        "clauseID"]
    ].head().to_html(index=False))
)

BHSwordPointed,BHSwordConsonantal,HebrewLexeme,clauseID
<heb>בְּ</heb><heb></heb>,<heb>ב</heb><heb></heb>,<heb>בְּ</heb>,c1
<heb>רֵאשִׁ֖ית</heb><heb> </heb>,<heb>ראשית</heb><heb> </heb>,<heb>רֵאשִׁית</heb>,c1
<heb>בָּרָ֣א</heb><heb> </heb>,<heb>ברא</heb><heb> </heb>,<heb>ברא</heb>,c1
<heb>אֱלֹהִ֑ים</heb><heb> </heb>,<heb>אלהים</heb><heb> </heb>,<heb>אֱלֹהִים</heb>,c1
<heb>אֵ֥ת</heb><heb> </heb>,<heb>את</heb><heb> </heb>,<heb>אֵת</heb>,c1


##### Define functions

In [66]:
# Textual items in between word nodes, including paragraph markers, etc. 
text_extensions = {
    '', '׃', '׃ ׆ ס ', ' ס ', '׃ ׆ ', '׃ ', ' ׀ ',
    ' ', '׃ פ ', ' פ ', '׀ ', '׃ ׆ פ ', '־', '׃ ס '
}

# ---
# Function that takes a column name from the original df and
# returns cleaned text (word and extension) in two lists.
# Use: BHS pointed and consonantal text, all of which is of a format similar
# to : <heb>הָ</heb><heb></heb>. Be sure to update the df with the return value. 
def clean_text(col_name):
    cleaned_text = []
    trailers = []
    # All of the junk html text present.
    remove_items = "/<arc>hebqrQR"
    # Either of these will appear between the word and extension.
    seperator = ["</heb><heb>", "</arc><arc>"]
    # Iterate over the original dataframe and clean the data. 
    for text_data in OHB_EXTENDED_DF[col_name]:
        # Place | at center so we can later split the text data. 
        for sep in seperator:
            if sep in text_data:
                text_data = text_data.replace(sep, '|')
        # Remove all extra items.
        for char in remove_items:
            if char in text_data:
                text_data = text_data.replace(char, "")

        # Note: I originally split each text and stored it in a list before 
        # appending to cleaned_text, but that caused an error when uploading 
        # to SQL because it needed an actual data type (e.g., string).
        
        # Add a text separated by | to cleaned text where pre '|' is 
        # a Heb word and post '|' is the trailer.
        word = text_data.split('|')[0]
        trailer = text_data.split('|')[1]
        cleaned_text.append(word)
        trailers.append(trailer)
    
    return cleaned_text, trailers


# ---
# Clean the text in the HebrewLexem column, all of which is
# in a format similar to: <heb>הָ</heb>. 
def clean_lexemes(col_name):
    cleaned_text = []
    # Read comments from clean_text()
    remove_items = "/<arc>hebqrQR"
    # Iterate over the original dataframe and clean the data. 
    for text_data in OHB_EXTENDED_DF[col_name]:
        # Remove all extra items.
        for char in remove_items:
            if char in text_data:
                text_data = text_data.replace(char, "")
        # Add the lexeme to the cleaned data.
        cleaned_text.append(text_data)

    return cleaned_text


# ---
# All clause data is of the format: c12. Remove the 'c's
# in the clause data and convert to int type. 
def clean_clauses(col_name):
    cleaned_ids = []
    # Iterate over the original dataframe and clean the data. 
    for clause in OHB_EXTENDED_DF[col_name]:
        cleaned_ids.append(int(clause.strip("c")))
        
    return cleaned_ids

# ---
# All lexemeID data is of the format: E70001. Remove the 'E's
# in the clause data and convert to int type - 70000. 
def clean_lex_ids(col_name):
    cleaned_ids = []
    # Iterate over the original dataframe and clean the data. 
    for clause in OHB_EXTENDED_DF[col_name]:
        cleaned_ids.append(int(clause.strip("E")) - 70000)
        
    return cleaned_ids

##### Call the functions -> update dataframe

In [67]:
# Update the data frame with the cleaned text and clauses. 
words, trailers = clean_text("BHSwordPointed")
OHB_EXTENDED_DF["BHSwordPointed"] = words
OHB_EXTENDED_DF.insert(
    OHB_EXTENDED_DF.columns.get_loc("BHSwordPointed"), 'Trailer', trailers)
OHB_EXTENDED_DF["BHSwordConsonantal"] = clean_text("BHSwordConsonantal")[0]
OHB_EXTENDED_DF["HebrewLexeme"] = clean_lexemes("HebrewLexeme")
OHB_EXTENDED_DF["clauseID"] = clean_clauses("clauseID")
OHB_EXTENDED_DF["lexemeID"] = clean_lex_ids("lexemeID")

##### Visualize cleaned data

In [69]:
# Print the head with the cleaned data.
display(HTML(
    OHB_EXTENDED_DF[
        ["BHSwordPointed", 
        "BHSwordConsonantal", 
        "Trailer",
        "HebrewLexeme", 
        "lexemeID",
        "clauseID"]
    ].head(5).to_html(index=False))
)

BHSwordPointed,BHSwordConsonantal,Trailer,HebrewLexeme,lexemeID,clauseID
בְּ,ב,,בְּ,1,1
רֵאשִׁ֖ית,ראשית,,רֵאשִׁית,2,1
בָּרָ֣א,ברא,,ברא,3,1
אֱלֹהִ֑ים,אלהים,,אֱלֹהִים,4,1
אֵ֥ת,את,,אֵת,5,1


### 2. Expand KJV, BHS, and BSB columns

In the original 22 columns there are three features that consist of concatenated values:

- 〔KJVverseSort｜KJVbook｜KJVchapter｜KJVverse〕:〔1｜1｜1｜1〕  
- 〔BHSverseSort｜BHSbook｜BHSchapter｜BHSverse〕:〔1｜1｜1｜1〕
- 〔BSBsort＠BSB〕:〔1＠In〕

I will convert each of them to new dataframes with separate columns for each value, and then merge them back into the original dataframe.

##### Visualize pre-cleaned data

In [17]:
# Data before being cleaned.
display(HTML(
    OHB_EXTENDED_DF[
        ['〔KJVverseSort｜KJVbook｜KJVchapter｜KJVverse〕', 
        '〔BHSverseSort｜BHSbook｜BHSchapter｜BHSverse〕',
        '〔BSBsort＠BSB〕']
    ].head().to_html(index=False))
)

〔KJVverseSort｜KJVbook｜KJVchapter｜KJVverse〕,〔BHSverseSort｜BHSbook｜BHSchapter｜BHSverse〕,〔BSBsort＠BSB〕
〔1｜1｜1｜1〕,〔1｜1｜1｜1〕,〔1＠In〕
〔1｜1｜1｜1〕,〔1｜1｜1｜1〕,〔2＠the beginning〕
〔1｜1｜1｜1〕,〔1｜1｜1｜1〕,〔4＠created〕
〔1｜1｜1｜1〕,〔1｜1｜1｜1〕,〔3＠God〕
〔1｜1｜1｜1〕,〔1｜1｜1｜1〕,


##### Define functions

In [18]:
# Where the value is a list, convert the original 
# column to len(list) new columns. 
updated_col_names = {
    '〔KJVverseSort｜KJVbook｜KJVchapter｜KJVverse〕': # cleaned
        ['KVJvsNode', 'KJVbook', 'KJVchapter', 'KJVverse'], 
    '〔BHSverseSort｜BHSbook｜BHSchapter｜BHSverse〕': # cleaned
        ['BHSvsNode', 'BHSbook', 'BHSchapter', 'BHSverse'], 
    '〔BSBsort＠BSB〕': # cleaned
        ['BSBglossNode', 'BSBgloss']
}

# ---
# Function that takes a column name from 
# the original df and returns cleaned data.
# Use: convert the KJV ref or BHS ref column to new dataframe. 
def clean_references(col_name):
    # Create a dict for each new name to the new values.
    new_names = updated_col_names[col_name]
    cleaned_data = {name:[] for name in new_names}
    # Iterate over the original dataframe and clean the data. 
    for ref_data in OHB_EXTENDED_DF[col_name]:
        # Remove the outsides of 〔1｜1｜1｜1〕.
        ref_data = ref_data.strip('〕〔')
        # Split 1｜1｜1｜1 and convert each item to an int.
        ref_data = [int(data) for data in ref_data.split('｜')]
        # Add data to the dictionary. 
        cleaned_data[new_names[0]].append(ref_data[0]) # vs node
        cleaned_data[new_names[1]].append(ref_data[1]) # book
        cleaned_data[new_names[2]].append(ref_data[2]) # chapter
        cleaned_data[new_names[3]].append(ref_data[3]) # verse
   
    # Convert the dictionary to a dataframe and return.
    new_df = pd.DataFrame(cleaned_data)
    return new_df

# ---
# Clean the BSB gloss data and store in a new dataframe. 
def clean_gloss(col_name):
    # Create a dict for each new name to the new values.
    new_names = updated_col_names[col_name]
    cleaned_data = {name:[] for name in new_names}
    # Iterate over the original dataframe and clean the data. 
    for gloss_data in OHB_EXTENDED_DF[col_name]:
        # Catch edge cases where gloss_data is NaN.
        if isinstance(gloss_data, str):
            # Remove the outsides of 〔1＠In〕.
            gloss_data = gloss_data.strip('〕〔')
            # Split 1＠In and convert first item to an int.
            gloss_data = gloss_data.split('＠')
            
            # For some reason, gloss node 237839 is split into 
            # decimals .1 and .2, which is why I am using a float. 

            # Add data to the dictionary. 
            gloss_data[0] = float(gloss_data[0])
            cleaned_data[new_names[0]].append(gloss_data[0]) # gloss node
            cleaned_data[new_names[1]].append(gloss_data[1]) # gloss
        else:
            cleaned_data[new_names[0]].append(gloss_data) # gloss node
            cleaned_data[new_names[1]].append(gloss_data) # gloss

    # Convert the dictionary to a dataframe and return.
    new_df = pd.DataFrame(cleaned_data)
    return new_df

##### Call the functions -> new dataframes

In [70]:
# Clean the reference data and store in two new dataframes. 
KJV_ref_df = clean_references(
    '〔KJVverseSort｜KJVbook｜KJVchapter｜KJVverse〕')
BHS_ref_df = clean_references(
    '〔BHSverseSort｜BHSbook｜BHSchapter｜BHSverse〕')
BSB_gloss_df = clean_gloss(
    '〔BSBsort＠BSB〕')

##### Visualize cleaned data

In [71]:
# Print the head of the cleaned data.
display(HTML(
    pd.concat(
        [KJV_ref_df, 
        BHS_ref_df, 
        BSB_gloss_df],
        axis=1
    ).head().to_html(index=False))
)

KVJvsNode,KJVbook,KJVchapter,KJVverse,BHSvsNode,BHSbook,BHSchapter,BHSverse,BSBglossNode,BSBgloss
1,1,1,1,1,1,1,1,1.0,In
1,1,1,1,1,1,1,1,2.0,the beginning
1,1,1,1,1,1,1,1,4.0,created
1,1,1,1,1,1,1,1,3.0,God
1,1,1,1,1,1,1,1,,


### 3. Rename columns and combine dataframes

##### Define functions

In [72]:
# ---
# Drop the three columns that I expanded into new dataframes.
def drop_old_data(dateframe):
    dateframe = dateframe.drop(
        columns=[
        '〔KJVverseSort｜KJVbook｜KJVchapter｜KJVverse〕', 
        '〔BHSverseSort｜BHSbook｜BHSchapter｜BHSverse〕',
        '〔BSBsort＠BSB〕']
    )
    return dateframe

# ---
# Rename columns and add the three 
# new dataframes for a final df output. 
def combine_data():
    df_copy = copy.deepcopy(OHB_EXTENDED_DF)
    updated_df = pd.DataFrame()
    # Make sure the replaced data gets dropped. 
    if "〔BSBsort＠BSB〕" in df_copy.columns:
        df_copy = drop_old_data(df_copy)
    # Rename the other columns and build updated_df.
    for column in df_copy:
        # If at the column before 〔KJVverseSort..., 
        # add the new reference dataframes.
        if column == "poetryMarker":
            updated_df = pd.concat(
                [updated_df, 
                df_copy[column], 
                KJV_ref_df, 
                BHS_ref_df], 
                axis=1)
        # If at the column before 〔BSBsort..., 
        # add the new BSB dataframe.
        elif column == "extendedGloss":
            updated_df = pd.concat(
                [updated_df, 
                df_copy[column], 
                BSB_gloss_df], 
                axis=1)
        # Otherwise add the current column 
        # from the original dataframe.
        else:
            updated_df[column] = df_copy[column]

    return updated_df

##### Call function -> combined dataframe

In [73]:
# Store the combined data in a new dataframe. 
ohb_extended_cleaned = combine_data()

### 4. Finally - Replace NaN values with "" to match string data.

##### Visualize pre-cleaned data

In [74]:
# Display the labeled data.
display(HTML(ohb_extended_cleaned.head().to_html(index=False)))

BHSwordSort,paragraphMarker,poetryMarker,KVJvsNode,KJVbook,KJVchapter,KJVverse,BHSvsNode,BHSbook,BHSchapter,BHSverse,clauseID,clauseKind,clauseType,language,Trailer,BHSwordPointed,BHSwordConsonantal,SBLstyleTransliteration,HebrewLexeme,lexemeID,extendedStrongNumber,morphologyCode,morphologyDetail,ETCBCgloss,extendedGloss,BSBglossNode,BSBgloss
1,¶,,1,1,1,1,1,1,1,1,1,Verbal clauses,x-qatal-X clause,Hebrew,,בְּ,ב,bĕ,בְּ,1,H9003,prep,preposition,in,in,1.0,In
2,,,1,1,1,1,1,1,1,1,1,Verbal clauses,x-qatal-X clause,Hebrew,,רֵאשִׁ֖ית,ראשית,rēšît,רֵאשִׁית,2,H7225,subs.f.sg.a,"noun, feminine, singular, absolute",beginning,beginning,2.0,the beginning
3,,,1,1,1,1,1,1,1,1,1,Verbal clauses,x-qatal-X clause,Hebrew,,בָּרָ֣א,ברא,bārā,ברא,3,H1254,verb.qal.perf.p3.m.sg,"verb, qal, perfect, third person, masculine, singular",create,[he]+ create,4.0,created
4,,,1,1,1,1,1,1,1,1,1,Verbal clauses,x-qatal-X clause,Hebrew,,אֱלֹהִ֑ים,אלהים,ʾĕlōhîm,אֱלֹהִים,4,H430,subs.m.pl.a,"noun, masculine, plural, absolute",god(s),god [pl.],3.0,God
5,,,1,1,1,1,1,1,1,1,1,Verbal clauses,x-qatal-X clause,Hebrew,,אֵ֥ת,את,ʾēt,אֵת,5,H853,prep,preposition,[object marker],[object marker],,


##### Define function

In [75]:
# All string type columns in updated_data that have NaNs present.
cols_with_NaN = [
    "paragraphMarker",
    "poetryMarker",
    "SBLstyleTransliteration",
    "BSBgloss"
]

# ---
# Function to replace NaNs with None. 
def replace_NaNs():
    for col in cols_with_NaN:
        ohb_extended_cleaned[col] = ohb_extended_cleaned[col].replace({np.nan: None})

##### Call function -> clean dataframe

Note: I retained NaNs in the *BSBglossNode* column since it stores floats.

In [76]:
# Call the function.
replace_NaNs()

##### Update column names

In [77]:
ohb_extended_cleaned.rename(columns=COL_NAMES, inplace=True)

##### Visualize the final cleaned dataframe

In [78]:
# Display our newly cleaned and labeled data.
display(HTML(ohb_extended_cleaned.head().to_html(index=False)))

bhs_id,par_marker,poem_marker,kjv_vs_id,kjv_bk,kjv_ch,kjv_vs,bhs_vs_id,bhs_bk,bhs_ch,bhs_vs,clause_id,clause_kind,clause_type,lang,trailer,word,word_cons,transliteration,lex,lex_id,strongs,morph_code,morph_detail,bhsa_gloss,extended_gloss,bsb_pos,bsb_gloss
1,¶,,1,1,1,1,1,1,1,1,1,Verbal clauses,x-qatal-X clause,Hebrew,,בְּ,ב,bĕ,בְּ,1,H9003,prep,preposition,in,in,1.0,In
2,,,1,1,1,1,1,1,1,1,1,Verbal clauses,x-qatal-X clause,Hebrew,,רֵאשִׁ֖ית,ראשית,rēšît,רֵאשִׁית,2,H7225,subs.f.sg.a,"noun, feminine, singular, absolute",beginning,beginning,2.0,the beginning
3,,,1,1,1,1,1,1,1,1,1,Verbal clauses,x-qatal-X clause,Hebrew,,בָּרָ֣א,ברא,bārā,ברא,3,H1254,verb.qal.perf.p3.m.sg,"verb, qal, perfect, third person, masculine, singular",create,[he]+ create,4.0,created
4,,,1,1,1,1,1,1,1,1,1,Verbal clauses,x-qatal-X clause,Hebrew,,אֱלֹהִ֑ים,אלהים,ʾĕlōhîm,אֱלֹהִים,4,H430,subs.m.pl.a,"noun, masculine, plural, absolute",god(s),god [pl.],3.0,God
5,,,1,1,1,1,1,1,1,1,1,Verbal clauses,x-qatal-X clause,Hebrew,,אֵ֥ת,את,ʾēt,אֵת,5,H853,prep,preposition,[object marker],[object marker],,


## Test OHB_EXTENDED Data Against OHB_DB

I want to compare values in the OHB_EXTENDED data to the OHB_DB data, especially text values, to make sure that the data is usable and accurate before merging features.

#### Feature mappings

In [85]:
col_map = {
    'ch_KJV': 'kjv_ch',
    'v_KJV': 'kjv_vs',
    'ch_BHS': 'bhs_ch',
    'v_BHS': 'bhs_vs',
    'clause_kind': 'clause_kind',
    'clause_typ': 'clause_type',
    'manuscript': 'word',
    'lex_Hebrew': 'lex',
}

#### Tests

In [92]:
def check_refs():
    mismatches = {}
    for col in ['ch_KJV', 'v_KJV', 'ch_BHS', 'v_BHS']:
        ohb_ref = [r for r in OHB_DB_DF[col]]
        ext_ref = [r for r in ohb_extended_cleaned[col_map[col]]]
        for i, ref in enumerate(ohb_ref):
            if ref != ext_ref[i]:
                mismatches[str(i)+col] = (ref, ext_ref[i])
    return mismatches

def check_clauses():
    mismatches = {}
    for col in ['clause_kind', 'clause_typ']:
        ohb_clause = [r for r in OHB_DB_DF[col]]
        ext_clause = [r for r in ohb_extended_cleaned[col_map[col]]]
        for i, clause in enumerate(ohb_clause):
            if clause != ext_clause[i]:
                mismatches[str(i)+col] = (clause, ext_clause[i])
    return mismatches

def check_text():
    # OHB_DB manuscript value is the text + the trailer.
    text_extensions = [
    '׃', '׃ ׆ ס ', ' ס ', '׃ ׆ ', '׃ ', ' ׀ ',
    ' ', '׃ פ ', ' פ ', '׀ ', '׃ ׆ פ ', '־', '׃ ס ', '׀', 'פ', 'ס', '<', 'Q', 'R', '>', 'q', 'r', '׆'
    ]
    mismatches = {}
    for col in ['manuscript']:
        ohb_text = [w for w in OHB_DB_DF[col]]
        ext_text = [w for w in ohb_extended_cleaned[col_map[col]]]
        for i, w in enumerate(ohb_text):
            w2 = ext_text[i]
            dif = [i for i in list(w2) if i not in list(w)]
            dif_checked = [i for i in dif if i not in text_extensions]
            if len(dif_checked) > 0:
                mismatches[str(i)+col] = (w, w2)
    return mismatches

def check_lex():
    mismatches = {}
    for col in ['lex_Hebrew']:
        ohb_text = [w for w in OHB_DB_DF[col]]
        ext_text = [w for w in ohb_extended_cleaned[col_map[col]]]
        for i, w in enumerate(ohb_text):
            if w != ext_text[i]:
                mismatches[str(i)+col] = (w, ext_text[i])
    return mismatches

#### Test Results

In [None]:
ref_mismatches = check_refs()
message1 = f"Refs Unaligned\n{ref_mismatches}" if len(ref_mismatches) > 0 else "Refs Aligned"
print(message1)

clause_mismatches = check_clauses()
message2 = f"Clauses Unaligned\n{clause_mismatches}" if len(clause_mismatches) > 0 else "Clauses Aligned"
print(message2)

text_mismatches = check_text()
message3 = f"Text Unaligned\n{text_mismatches}" if len(text_mismatches) > 0 else "Text Aligned"
print(message3)

lex_mismatches = check_lex()
message4 = f"Lex Unaligned\n{lex_mismatches}" if len(lex_mismatches) > 0 else "Lex Aligned"
print(message4)

**We have:**

Refs Unaligned

{'152831v_KJV': (1, 2), '152832v_KJV': (1, 2), '152833v_KJV': (1, 2), '152834v_KJV': (1, 2), '152835v_KJV': (1, 2), '152836v_KJV': (1, 2), '152837v_KJV': (1, 2), '152838v_KJV': (1, 2), '191090v_KJV': (34, 33), '191091v_KJV': (34, 33), '191092v_KJV': (34, 33), '191093v_KJV': (34, 33), '191094v_KJV': (34, 33), '191095v_KJV': (34, 33), '191096v_KJV': (34, 33), '191097v_KJV': (34, 33), '191098v_KJV': (34, 33), '191099v_KJV': (34, 33), '191100v_KJV': (34, 33), '191101v_KJV': (34, 33), '191102v_KJV': (34, 33), '191103v_KJV': (34, 33), '191104v_KJV': (34, 33), '191962v_KJV': (3, 2), '191963v_KJV': (3, 2), '191964v_KJV': (3, 2), '191965v_KJV': (3, 2), '191966v_KJV': (3, 2), '191967v_KJV': (3, 2), '194111v_KJV': (21, 22), '194112v_KJV': (21, 22), '194113v_KJV': (21, 22), '194114v_KJV': (21, 22), '194115v_KJV': (21, 22), '194116v_KJV': (21, 22), '194578v_KJV': (44, 43), '194579v_KJV': (44, 43), '194580v_KJV': (44, 43), '194581v_KJV': (44, 43), '194582v_KJV': (44, 43), '194583v_KJV': (44, 43), '194584v_KJV': (44, 43), '194585v_KJV': (44, 43), '194586v_KJV': (44, 43), '194587v_KJV': (44, 43), '194588v_KJV': (44, 43), '194589v_KJV': (44, 43), '194590v_KJV': (44, 43), '194591v_KJV': (44, 43)}

Clauses Aligned

Text Aligned

Lex Aligned

**Next Steps:**
Check references against STEP Bible.

In [None]:
# To visualize which books and chapters the divergent refs occur in. 
if len(ref_mismatches) > 0:
    formatted_refs = {}
    df = ohb_extended_cleaned
    for k in ref_mismatches:
        node = int(k[:6])
        bk = df.iloc[node]['kjv_bk']
        ch = df.iloc[node]['kjv_ch']
        vs = df.iloc[node]['kjv_vs']
        w = df.iloc[node]['word']
        formatted_refs[k] = f"{bk}:{ch}:{vs} {w}"

print(formatted_refs)

**Comparison to STEP Bible**

Nodes 152831-152838:
- ohb_db: 1
- ohb_ext: 2
- STEP: 2 (these words are in vs 1 of BHS)

Nodes 191090-191104:
- ohb_db: 34
- ohb_ext: 33
- STEP: 33 (these words are in vs 34 of BHS)

Nodes 191962-191967:
- ohb_db: 3
- ohb_ext: 2
- STEP: 2 (these words are in vs 3 of BHS)

Nodes 194111-194116:
- ohb_db: 21
- ohb_ext: 22
- STEP: 22 (these words are in vs 21 of BHS)

Nodes 194578-194591:
- ohb_db: 44
- ohb_ext: 43
- STEP: 43 (these words are in vs 43 of BHS)

**Observations:** The OHB_EXTENDED data accurately reflects the KJV. In most divergent cases, the OHB_DB data is reflecting the BHS verse value for a node rather than the KJV verse value. 


## Save OHB_EXTENDED as CSV file

In [103]:
file = '../data_files/ohb_extended_cleaned.csv'
ohb_extended_cleaned.to_csv(file, index=False)

## Combine OHB_EXTENDED and OHB_DB and BHSA_DB features into new DF

In [106]:
OHB_COMBINED = copy.deepcopy(ohb_extended_cleaned)

# Data from OHB_DB
OHB_COMBINED['lang'] = OHB_DB_DF['lang']
OHB_COMBINED['phrase_typ'] = OHB_DB_DF['phrase_typ']
OHB_COMBINED['phrase_det'] = OHB_DB_DF['phrase_det']
OHB_COMBINED['phrase_function'] = OHB_DB_DF['phrase_function']

# Data to drop that we'll get from BHSA_DB
OHB_COMBINED.drop(columns=['morph_code', 'morph_detail'], inplace=True)

# Replace Nans with None
for col in ['phrase_typ', 'phrase_det', 'phrase_function']:
    OHB_COMBINED[col] = OHB_COMBINED[col].replace({np.nan: None})

In [107]:
# Display our newly combined data.
display(HTML(OHB_COMBINED.head().to_html(index=False)))

bhs_id,par_marker,poem_marker,kjv_vs_id,kjv_bk,kjv_ch,kjv_vs,bhs_vs_id,bhs_bk,bhs_ch,bhs_vs,clause_id,clause_kind,clause_type,lang,trailer,word,word_cons,transliteration,lex,lex_id,strongs,bhsa_gloss,extended_gloss,bsb_pos,bsb_gloss,phrase_typ,phrase_det,phrase_function
1,¶,,1,1,1,1,1,1,1,1,1,Verbal clauses,x-qatal-X clause,hbo,,בְּ,ב,bĕ,בְּ,1,H9003,in,in,1.0,In,Prepositional phrase,undetermined,Time reference
2,,,1,1,1,1,1,1,1,1,1,Verbal clauses,x-qatal-X clause,hbo,,רֵאשִׁ֖ית,ראשית,rēšît,רֵאשִׁית,2,H7225,beginning,beginning,2.0,the beginning,Prepositional phrase,undetermined,Time reference
3,,,1,1,1,1,1,1,1,1,1,Verbal clauses,x-qatal-X clause,hbo,,בָּרָ֣א,ברא,bārā,ברא,3,H1254,create,[he]+ create,4.0,created,Verbal phrase,,Predicate
4,,,1,1,1,1,1,1,1,1,1,Verbal clauses,x-qatal-X clause,hbo,,אֱלֹהִ֑ים,אלהים,ʾĕlōhîm,אֱלֹהִים,4,H430,god(s),god [pl.],3.0,God,Nominal phrase,undetermined,Subject
5,,,1,1,1,1,1,1,1,1,1,Verbal clauses,x-qatal-X clause,hbo,,אֵ֥ת,את,ʾēt,אֵת,5,H853,[object marker],[object marker],,,Prepositional phrase,determined,Object


In [108]:
# Save as new file.
file1 = '../data_files/ohb_combined.csv'
OHB_COMBINED.to_csv(file, index=False)

## Align OHB_EXTENDED with BHSA_DB and Test

See notes [here](https://docs.google.com/document/d/1WE59plLi8EQTaDijHkdgPCVAwOc_TlQU4GvDjyEsPAA/edit?usp=sharing).

Set node 16563 to 16564 and increment all nodes after to i + 1.

I Expanded node 392485 into three nodes manually and saved the data in ohb_combined_aligned.csv.

Increment all node values that come after 392485. 

Set marked value 3924860 to 392489 and all after to i + 3.

In [170]:
# Load the aligned file that I manually edited
file2 = '../data_files/ohb_combined_aligned.csv'
OHB_ALIGNED = pd.read_csv(file, sep=',', low_memory=False)

In [181]:
# Increment nodes.
def update_nodes():
    mismatches = []
    bhs_nodes = [n for n in BHSA_DB_DF['_id'] if n != 16563]
    nodes = [n for n in OHB_ALIGNED['bhs_id']]
    k = 0
    # Update node values
    for i, n in enumerate(nodes):
        if n == 16563:
            nodes[i] = 16564
            k += 1
        elif n == 3924860:
            nodes[i] = 392489
            k += 2
        else:
            nodes[i] += k
    # Compare updated values against BHSA
    for i, n in enumerate(nodes):
        if n != bhs_nodes[i]:
            mismatches.append((i, n, bhs_nodes[i]))
            # break
    if len(mismatches) > 0:
        # raise Exception("The nodes don\'t match")
        print("Mismatches:\n", mismatches)
        return None
    else: 
        return nodes

In [182]:
# Update OHB_ALIGNED with aligned nodes.
nodes = update_nodes()
if nodes:
    OHB_ALIGNED['bhs_id'] = nodes
    # Save to file
    OHB_ALIGNED.to_csv(file2, index=False)

### Tests

In [203]:
col_map = {
    'bhs_id': '_id',
    'word_cons': 'g_cons_utf8',
    'word': 'g_word_utf8',
    'bhsa_gloss': 'gloss',
    'lang': 'languageISO',
    'trailer': 'trailer_utf8',
    'lex': 'voc_lex_utf8'
}

In [211]:
for c in col_map:
    OHB_ALIGNED[c] = OHB_ALIGNED[c].replace({np.nan: ""})

def test_aligned():
    mismatches = {k:[] for k in col_map}
    qere_words = [i for i in BHSA_DB_DF['qere_utf8']]  
    qere_trailers = [i for i in BHSA_DB_DF['qere_trailer_utf8']]
    for col in col_map:
        ohb_data = [i for i in OHB_ALIGNED[col]]
        bhs_data = [i for i in BHSA_DB_DF[col_map[col]]]
        k = 0
        for i, d in enumerate(ohb_data):
            if i == 16562:
                k+=1
            i += k
            # See https://etcbc.github.io/bhsa/features/qere_utf8/
            if col == 'word':
                w = bhs_data[i] if not qere_words[i] else qere_words[i]
                if d != w:
                    mismatches[col].append((i+1, d, w))
            elif col == 'trailer':
                t = bhs_data[i] if not qere_trailers[i] else qere_trailers[i]
                if d != t:
                    mismatches[col].append((i+1, d, t))
            # bhsa uses <> rather than [] for values like object marker.
            elif col == 'bhsa_gloss':
                d2 = bhs_data[i]
                dif = [i for i in list(d2) if i not in list(d)]
                dif_checked = [i for i in dif if i not in ['<','>','[',']']]
                if len(dif_checked) > 0:
                    mismatches[col].append((i+1, d, bhs_data[i]))
            elif d != bhs_data[i]:
                mismatches[col].append((i+1, d, bhs_data[i]))
    return mismatches

In [205]:
# Collect mismatch data and save csv files.
mismatches = test_aligned()
path = '../data_files/mismatches/'
def export_mismatches():
    for k in mismatches:
        # the cons vals in BHSA don't have shin/sin differentiation.
        if k != 'word_cons' and len(mismatches[k]) > 0:
            data = {'node':[], 'ohb':[], 'bhsa':[]}
            for v in mismatches[k]:
                n, o, b = v
                data['node'].append(n)
                data['ohb'].append(o)
                data['bhsa'].append(b)
            df = pd.DataFrame(data)
            df.to_csv(f"{path}{k}.csv", index=False)

In [206]:
# Save csv files. 
export_mismatches()

### Mismatch notes

**word.csv:** 1869 mismatches that predominantly consist of g_word_utf8 lacking pointings. It seems best to use the OHB data in this case.

**lex.csv:** 3 mismatches
```
node	ohb	bhsa
152522	חַי	חַיִּים
392488	מְּנֻחֹות	מְנוּחָה
394199	ושׁני֜	וַשְׁנִי
```
**bhsa_gloss.csv:** 490 mismatches -- mostly repeats (e.g., where ohb has cloth and bhsa has clothe). BHSA seems to be more accurate here. 

**trailer.csv:** 150 mismatches

**Note:** the BHSA has [features](https://etcbc.github.io/bhsa/features/qere_utf8/) qere_utf8 and qere_trailer_utf8 that provide vocalized data when it is lacking in the *ketiv* form. I've updated the mismatches code to chose those vocalized options in the BHSA data when the ketiv form is missing pointings.

**UPDATED MISMATCHES**

**word.csv:** 2 differences
```
node	ohb	bhsa
199283		ה
205832	שֻׁ֝֩בו 	שֻׁ֝֠בוּ
```
**trailer.csv:** 6 differences, bhsa following qere. 
```
node	ohb	bhsa
137795	''	 ' ' 
156164	''	־
227810	''	 ' '
345548	''	 ' '
363613	''	 ' '
364988	''	 ' '
```

**CONCLUSIONS**

Most of these are insignificant. It is likely best to go with the BHSA gloss column rather than the OHB gloss column. 

In [None]:
# View specified rows in OHB and BHSA
i = 199283-5
display(HTML(OHB_COMBINED.loc[i:i+10].to_html(index=False)))

In [None]:
display(HTML(BHSA_DB_DF.loc[i:i+10].to_html(index=False)))

## Importing the Dataframe Into a SQL Database

In [None]:
sql_file = '../data_files/ohb_extended.db'
con = sqlite3.connect(sql_file) 
# Convert the dataframe to a table called extended_data in the bhsa database. 
ohb_extended_cleaned.to_sql(
    "extended_data", 
    con=con, 
    if_exists='replace', 
    index=False)