## Identify the Key Phrases Most Often Shared between all Four Gospels
My goal of this exercise will be to identify the key phrases shared by all the Gospels (Matthew, Mark, Luke, John),and then see how frequently they appear in each book. 

### Reading In Already-Identified Key Phrases from *bible_key_phrase_extractor.ipynb* Outputs
The Jupyter notebook *bible_key_phrase_extractor.ipynb* contains the scripts to make the key phrase extraction using the MS Azure AI Language API. These scripts receive as input the filename of a Bible file (according to the Catholic Public Digitial Version https://bitbucket.org/sbruno/cpdv-json-encoder/src/master/), and then output JSON files into datasets/key_phrases_results summarizing the key phrases in each book, along with their frequency. 

In short, the key phrase identification work should already be done by the time you reach this Jupyter notebook; all you need to do is analyze the outputs. 

## Key Phrase Analyses
The analyses begins here. First import libraries. 

In [12]:
# Import libraries

import json
import pandas as pd
import plotly.express as px



### Function Definitions
Define functions to count the frequency of key phrases across an entire book. This assumes that you already have the key phrase extraction completed, with outputs saved into datasets/key_phrases_results



In [10]:
# Define a function to count the number of key phrases in a given text.
# It reads one of the JSON files in datasets/key_phrases_results and then outputs a dictionary whose keys are the key phrases and whose values are the number of times that key phrase appears in the text.
# Drop, is > -1, means it just takes the most common phrases.  For example, if drop = 10, it will only return the 10 most common phrases.
def count_key_phrases(filename, drop = -1):
    
    # Read the JSON file using the JSON library at datasets/key_phrases_results/filename and store it in kp_dict
    
    with open('datasets/key_phrases_results/' + filename) as f:
        kp_dict = json.load(f)
    
    # First, combine all the values from key_phrases_output_dict into a single list.
    # This is a list of lists.
    # Then flatten this.
    
    list_of_lists = list(kp_dict.values())
    list_of_lists = [item for sublist in list_of_lists for item in sublist]

    # Now count the total number of time each phrase appears in list_of_lists
    
    key_phrase_counts = {i:list_of_lists.count(i) for i in list_of_lists}

    # Then sort numerically from highest to lowest by the values.
    
    key_phrase_counts = {k: v for k, v in sorted(key_phrase_counts.items(), key=lambda item: item[1], reverse = True)}

    # Select only the first drop elements of the dictionary
    
    if drop >-1:
        key_phrase_counts = dict(list(key_phrase_counts.items())[0:drop])

    return key_phrase_counts


# This functions compares two dictionaries (or lists) of key phrases and returns a list of the shared key phrases.
# The input to this function must be in either 1) the format of the output of count_key_phrases (a dictionary), or 2) a list containing key phrases.

def shared_words(phrases_dict1,phrases_dict2):
    
    # If lists are passed, there's no change.

    if type(phrases_dict1) == list:
        list1 = phrases_dict1
    if type(phrases_dict2) == list:
        list2 = phrases_dict2
        
    # If dictionaries are passed, extract keys, which are the key phrases.
        
    if type(phrases_dict1) == dict:
        list1 = list(phrases_dict1.keys())
    if type(phrases_dict2) == dict:
        list2 = list(phrases_dict2.keys())

    # Build the mutual list by reviewing list 1 and only accepting entries which appear in list 2.
        
    shared_words = []
    for key_phrase in list1:
        if key_phrase in list2:
            shared_words.append(key_phrase) 

    return shared_words

In [33]:
# Now, execute these functions and extract all the key phrases from the Four Gospels. 

John_key_phrase_count = count_key_phrases('NT-04_John_key_phrases.json')
Matthew_key_phrase_count = count_key_phrases('NT-01_Matthewkey_phrases.json')
Lk_key_phrase_count = count_key_phrases('NT-03_Lukekey_phrases.json')
Mk_key_phrase_count = count_key_phrases('NT-02_Mark.key_phrases.json')
print(f"There were {len(John_key_phrase_count)} key phrases found in John: {John_key_phrase_count}")
print(f"There were {len(Matthew_key_phrase_count)} key phrases found in Matthew: {Matthew_key_phrase_count}")
print(f"There were {len(Lk_key_phrase_count)} key phrases found in Luke: {Lk_key_phrase_count}")
print(f"There were {len(Mk_key_phrase_count)} key phrases found in Matthew: {Mk_key_phrase_count}")

shared_gospels =  shared_words(Matthew_key_phrase_count,Mk_key_phrase_count)
shared_gospels =  shared_words(shared_gospels,Lk_key_phrase_count)
shared_gospels =  shared_words(shared_gospels,John_key_phrase_count)
print(f"I found {len(shared_gospels)} key phrases that were shared by all four gospels: {shared_gospels}")

There were 551 key phrases found in John: {'Jesus': 16, 'things': 15, 'world': 15, 'God': 14, 'Jews': 13, 'Lord': 12, 'disciples': 12, 'testimony': 11, 'Father': 11, 'name': 10, 'truth': 10, 'one': 9, 'place': 9, 'word': 9, 'glory': 8, 'Pharisees': 8, 'hour': 8, 'judgment': 8, 'Son': 8, 'man': 7, 'law': 7, 'Jerusalem': 7, 'Galilee': 7, 'will': 6, 'Moses': 6, 'Passover': 6, 'eternal life': 6, 'Amen': 6, 'works': 6, 'Christ': 6, 'feast day': 6, 'eyes': 6, 'death': 6, 'Simon Peter': 5, 'darkness': 5, 'John': 5, 'voice': 5, 'way': 5, 'water': 5, 'signs': 5, 'temple': 5, 'house': 5, 'beginning': 4, 'men': 4, 'flesh': 4, 'sin': 4, 'heaven': 4, 'Philip': 4, 'Joseph': 4, 'mother': 4, 'something': 4, 'night': 4, 'Judea': 4, 'joy': 4, 'everything': 4, 'eternity': 4, 'Prophet': 4, 'spirit': 4, 'Sabbath': 4, 'high priests': 4, 'Scripture': 4, 'heart': 4, 'next day': 3, 'Holy Spirit': 3, 'Life': 3, 'desert': 3, 'Bethania': 3, 'reason': 3, 'Israel': 3, 'Rabbi': 3, 'Andrew': 3, 'brother': 3, 'city': 

In [44]:
# In the first analyses, I will show which phrases are most commonly used across all four Gospels. 

kpe_totals_mat = dict()
kpe_totals_all = dict()
for key_phrase in shared_gospels:
    
    kpe_totals_all[key_phrase] = Matthew_key_phrase_count[key_phrase] + Mk_key_phrase_count[key_phrase] + Lk_key_phrase_count[key_phrase] + John_key_phrase_count[key_phrase]

   
# Then sort numerically from highest to lowest by the values.

kpe_totals_all = {k: v for k, v in sorted(kpe_totals_all.items(), key=lambda item: item[1], reverse = True)}
print(kpe_totals_all)

{'Jesus': 63, 'God': 60, 'Lord': 54, 'disciples': 50, 'things': 45, 'kingdom': 42, 'house': 42, 'heaven': 39, 'name': 37, 'word': 36, 'way': 34, 'Pharisees': 34, 'Galilee': 33, 'Jerusalem': 32, 'order': 29, 'people': 29, 'world': 29, 'scribes': 28, 'Son': 28, 'response': 27, 'John': 27, 'city': 26, 'Father': 24, 'death': 24, 'place': 24, 'mother': 23, 'man': 22, 'sea': 21, 'earth': 21, 'hour': 21, 'law': 21, 'temple': 21, 'one': 21, 'testimony': 21, 'heart': 20, 'hands': 20, 'Christ': 20, 'authority': 20, 'Moses': 20, 'eyes': 20, 'Jews': 20, 'Amen': 19, 'Judea': 19, 'glory': 19, 'Teacher': 18, 'brother': 18, 'table': 18, 'others': 18, 'truth': 18, 'priests': 17, 'Israel': 17, 'bread': 17, 'life': 17, 'everyone': 17, 'everything': 17, 'leaders': 16, 'hand': 16, 'night': 16, 'feet': 16, 'head': 15, 'woman': 15, 'Holy Spirit': 15, 'fruit': 15, 'water': 15, 'hold': 14, 'sons': 14, 'sins': 14, 'will': 14, 'David': 14, 'voice': 14, 'men': 14, 'power': 14, 'land': 13, 'body': 13, 'spirit': 13

In [47]:
# Empty out kpe_totals_df
kpe_totals_df = pd.DataFrame()

# TODO: You need to drop the keys from each phrase count dictioanry that do not appear in the shared gospels list. 

# Then populate it with the key phrases and the counts from Matthew_key_phrase_count
kpe_totals_df = pd.DataFrame.from_dict(Matthew_key_phrase_count, orient='index', columns = ['Matthew KPE Count'])

# Append a column to kpe_totals_df that is Lk_key_phrase_count
kpe_totals_df['Luke KPE Count'] = kpe_totals_df.index.map(Lk_key_phrase_count)
kpe_totals_df['Mark KPE Count'] = kpe_totals_df.index.map(Mk_key_phrase_count)
kpe_totals_df['John KPE Count'] = kpe_totals_df.index.map(John_key_phrase_count)

# Transpose the dataframe
kpe_totals_df = kpe_totals_df.transpose()

# Print out the dimensions of the dataframe
print(kpe_totals_df.shape)
#print(kpe_totals_df.head())
    



(4, 1016)


In [43]:
import plotly.express as px
df = kpe_totals_df
fig = px.bar(df, orientation='h')
fig.show()