# Prediction

This uses most of the same methods found in "inference" jupyter notebook only that we only care what the prediction is and not what the "actual" answer is.<br>
If you already have the file "_Node_NLP_predictions.xlsx", skip to the [Predicting](#Predicting) section

The following code chunks below should all be ran regardless of if you use the [Set Up](#set-up) or not

In [1]:
import pandas as pd
import numpy as np
import re

# for stripping OCMs (if relevant)
def OCM_stripper(df, OCM='OCM'):
    df[OCM] = df[OCM].apply(lambda x: re.sub(" |\'",'',x))
    df[OCM] = df[OCM].apply(lambda x: x[1:-1].split(','))
    return df

In [2]:
# CHANGE Folder where your files are located
folder = '(subjects-(contracts_OR_disabilities_OR_disasters_OR_friendships_OR_gift_giving_OR_infant_feeding_OR_lineages_OR_local_officials_OR_luck_and_chance_OR_magicians_and_diviners_OR_mortuary_specialists_OR_nuclear_family_OR_priesthood_OR_prophet'
directory = f"../../../eHRAF_Scraper-Analysis-and-Prep/Data/{folder}/"

Load transformer to get labels. If you do not need to do the [Set Up](#set-up) section and just want to see prediction slices, you can manually enter the labels in the [Predicting](#Predicting) section

In [None]:
from transformers import pipeline, AutoTokenizer

# set up the pipeline from Hugging Face (optional)
# f = open('../_HuggingFace_Auth.txt') #get the hugging face token, otherwise, enter your own in
# f = f.readlines()[0]
# classifier = pipeline("text-classification", top_k=None, model="Chantland/Hraf_MultiLabel", use_auth_token=f, tokenizer=AutoTokenizer.from_pretrained("distilbert-base-uncased"))


# Set up pipeline and classifier through existent folder
# Define tokenizer kwargs
tokenizer_kwargs = {'padding':True,'truncation':True,'max_length':512}

classifier_kwargs = {'top_k':None, 'device':0} #Set device -1 for CPU, 0 or higher for GPU

# CHANGE model_name name
model_name = "model_5_Roberta/Learning_Rate_2e-05_Weight_Decay_0.01_fold_1"
checkpoint_path = "checkpoint-12450"
# set up the pipeline from local
import os
path =os.path.abspath(f"{model_name}/{checkpoint_path}")
classifier = pipeline("text-classification", model=path, **classifier_kwargs)


# sample inference ENTER TEXT IN HERE.
text = '''
“Drinking-tubes made of the leg-bones of swans (Fig. 109) are 190 also used chiefly as a measure of precaution against diseases ‘subject to shunning.’....”
'''
# reveal sample classification
prediction = classifier(text, **tokenizer_kwargs)
prediction



In [None]:

# Get labels
labels = [x['label'] for x in prediction[0]] 
labels

['EVENT_Illness',
 'ACTION_Physical_Material',
 'CAUSE_Rule_Violation_Taboo',
 'CAUSE_Witchcraft_Sorcery',
 'CAUSE_Spirits_Gods',
 'CAUSE_Material_Physical',
 'ACTION_Priest_High_Religion',
 'ACTION_Shaman_Medium_Healer',
 'EVENT_Other',
 'ACTION_Divination',
 'EVENT_Accident',
 'ACTION_Technical_Specialist']

## Predict

Functions and code used to do inference. Run all of these cells before using the functions.

This code will predict any dataset you give it so long as you have a "passage" and "ID" columns. Expect to take ~400 passages per minute

In [25]:


def dataframeCreator(df, labels, passage_name, ID_name:str=False, culture_name:str=False, OCM_name:str=False, values_to_remove:list=False):
    df_small = pd.DataFrame()
    # create columns
    assert isinstance(passage_name, str), "Need to supply the column name that you will be using to predict text as a string"
    # add ID column if present
    if ID_name is not False:
        assert isinstance(ID_name, str), "Need to supply the ID column header as a string"
        df_small["ID"] = df[ID_name]

        # use a list of integers to remove specific ID passages
        if values_to_remove is not False:
            df_small = df_small[~df_small['ID'].isin(values_to_remove)]
    # add culture column if present
    if culture_name is not False:
        assert isinstance(culture_name, str), "Need to supply the culture column header as a string"
        df_small["culture"] = df[culture_name]
    # add OCM column if present
    if OCM_name is not False:
        assert isinstance(OCM_name, str), "Need to supply the OCM column header as a string"
        df_small["OCM"] = df[OCM_name]
        # Turn the string of column OCM back into a list 
        df_small = OCM_stripper(df_small)

    df_small["passage"] = df[passage_name]

    # create columns based off of the labels we will use to predict the text
    df_small[labels] = np.nan
    
    return df_small



In [30]:
def predict(df, labels, passage_name:str="passage", tokenizer_kwargs=tokenizer_kwargs):

    passage_list = df[passage_name]

    
    for index, text in enumerate(passage_list):

        prediction = classifier(text, **tokenizer_kwargs)

        # get predictions
        scores = {item['label']:item['score'] for item in prediction[0]} #turn prediction into a dictionary
        pred_labels = [1 if scores[label] >= 0.5 else 0 for label in labels]
        df.loc[index, labels] = pred_labels
    return df

### Initial Dataframe SetUp

Extract passages

In [22]:

# Old code for having a small subset of the columns, but I decided it may jsut be better to include all the fluff
# # Extract passages

# # CHANGE All of these depending on your dataframe locationand its column names
# df = pd.read_excel(directory+"_Altogether_Dataset_CLEANED.xlsx")
# passage_name = "Passage"
# ID_name = "Passage Number" #OPTIONAL but reccommended for MISF datasets
# culture_name = "Culture" #OPTIONAL but reccommended for MISF datasets
# OCM_name = "OCM" #OPTIONAL but reccommended for MISF datasets

# # fit up the dataframe so it can be more easily used for making predictions
# df_small = dataframeCreator(df=df, labels=labels, passage_name=passage_name, ID_name=ID_name, culture_name=culture_name, OCM_name=OCM_name)

# # # here is the basic version if you don't want ID's or cultures
# # df_small = dataframeCreator(df=df, labels=labels, passage_name=passage_name)

# df_small.head(4)


# Extract passages (include all the other column fluff)
df_small = pd.read_excel(directory+"_Altogether_Dataset_CLEANED_ALLSCRAPED.xlsx")
df_small[labels] = np.nan #preallocate space
df_small = OCM_stripper(df_small)  
df_small = df_small.rename(columns={"Passage Number":"ID"}) #Make Passage Number standardized to ID to match with machine learning labels (and more future proof, change Passage Number should this column name be discontinued or changed prior)

#Add in checkpoint Info Column to indicate what model is Predicting this

df_small["Model Info"] = None
df_small.loc[0, "Model Info"] = f"Model: {model_name}"
df_small.loc[1, "Model Info"] = f"Checkpoint: {checkpoint_path}"

df_small

Unnamed: 0,ID,Region,SubRegion,Culture,DocTitle,Section,Author,Page,Year,OCM,...,CAUSE_Witchcraft_Sorcery,CAUSE_Spirits_Gods,CAUSE_Material_Physical,ACTION_Priest_High_Religion,ACTION_Shaman_Medium_Healer,EVENT_Other,ACTION_Divination,EVENT_Accident,ACTION_Technical_Specialist,Model Info
0,1,Africa,Northern Africa,Libyan Bedouin,Writing women's worlds: Bedouin stories,Losing Men,"Abu-Lughod, Lila",57,1993,"[753, 761, 902]",...,,,,,,,,,,Model: model_5_Roberta/Learning_Rate_2e-05_Wei...
1,2,Africa,Northern Africa,Libyan Bedouin,Writing women's worlds: Bedouin stories,Losing Men,"Abu-Lughod, Lila",59,1993,"[164, 752, 902]",...,,,,,,,,,,Checkpoint: checkpoint-12450
2,3,Africa,Northern Africa,Libyan Bedouin,Writing women's worlds: Bedouin stories,Losing Men,"Abu-Lughod, Lila",60,1993,"[164, 752, 902]",...,,,,,,,,,,
3,4,Africa,Northern Africa,Libyan Bedouin,Writing women's worlds: Bedouin stories,Losing Men,"Abu-Lughod, Lila",60,1993,"[164, 752, 902]",...,,,,,,,,,,
4,5,Africa,Northern Africa,Libyan Bedouin,Writing women's worlds: Bedouin stories,Losing Men,"Abu-Lughod, Lila",61,1993,"[164, 752, 902]",...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
109017,132676,North-America,Plains and Plateau,Pawnee,When stars came down to earth: cosmology of th...,OTHER STARS,"Chamberlain, Von Del",134,1982,"[793, 805, 821]",...,,,,,,,,,,
109018,132679,North-America,Plains and Plateau,Pawnee,When stars came down to earth: cosmology of th...,4 STAGING: THE SKIDI OBSERVATIONAL SYSTEM,"Chamberlain, Von Del",175,1982,"[121, 342, 353, 793]",...,,,,,,,,,,
109019,132680,North-America,Plains and Plateau,Pawnee,When stars came down to earth: cosmology of th...,4 STAGING: THE SKIDI OBSERVATIONAL SYSTEM,"Chamberlain, Von Del",179,1982,"[121, 342, 793]",...,,,,,,,,,,
109020,132682,North-America,Plains and Plateau,Pawnee,The chief and his council: unity and authority...,The United Stars,"Chamberlain, Von Del",229,1992,"[793, 805, 821]",...,,,,,,,,,,


Optionally add the dataset indicators if they exist (otherwise disregard this cell)

In [28]:
df_small.columns

Index(['ID', 'Region', 'SubRegion', 'Culture', 'DocTitle', 'Section', 'Author',
       'Page', 'Year', 'OCM', 'OWC', 'Passage', 'EVENT_Illness',
       'ACTION_Physical_Material', 'CAUSE_Rule_Violation_Taboo',
       'CAUSE_Witchcraft_Sorcery', 'CAUSE_Spirits_Gods',
       'CAUSE_Material_Physical', 'ACTION_Priest_High_Religion',
       'ACTION_Shaman_Medium_Healer', 'EVENT_Other', 'ACTION_Divination',
       'EVENT_Accident', 'ACTION_Technical_Specialist', 'Model Info'],
      dtype='object')

In [None]:

df_dataset = pd.read_excel(directory+"_Dataset_Lists.xlsx")
df_dataset = df_dataset.rename(columns={'Passage Number':'ID'}) #rename
df_dataset = df_dataset[["ID","Dataset"]] # only use the following columns

df_small = df_small.merge(df_dataset, on='ID', how='left')
print(df_small["Dataset"].value_counts(dropna=False))
print(f"Total: {len(df_small)}")

Dataset
2    16210
1     6112
Name: count, dtype: int64
Total: 22322


### Sample run

In [31]:
# Uncomment for sample run
df_shaved = df_small.iloc[0:200].copy()
df_shaved = predict(df_shaved, labels, passage_name="Passage", tokenizer_kwargs=tokenizer_kwargs)
df_shaved.head(4)

Unnamed: 0,ID,Region,SubRegion,Culture,DocTitle,Section,Author,Page,Year,OCM,...,CAUSE_Witchcraft_Sorcery,CAUSE_Spirits_Gods,CAUSE_Material_Physical,ACTION_Priest_High_Religion,ACTION_Shaman_Medium_Healer,EVENT_Other,ACTION_Divination,EVENT_Accident,ACTION_Technical_Specialist,Model Info
0,1,Africa,Northern Africa,Libyan Bedouin,Writing women's worlds: Bedouin stories,Losing Men,"Abu-Lughod, Lila",57,1993,"[753, 761, 902]",...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,Model: model_5_Roberta/Learning_Rate_2e-05_Wei...
1,2,Africa,Northern Africa,Libyan Bedouin,Writing women's worlds: Bedouin stories,Losing Men,"Abu-Lughod, Lila",59,1993,"[164, 752, 902]",...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,Checkpoint: checkpoint-12450
2,3,Africa,Northern Africa,Libyan Bedouin,Writing women's worlds: Bedouin stories,Losing Men,"Abu-Lughod, Lila",60,1993,"[164, 752, 902]",...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,
3,4,Africa,Northern Africa,Libyan Bedouin,Writing women's worlds: Bedouin stories,Losing Men,"Abu-Lughod, Lila",60,1993,"[164, 752, 902]",...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,



### Set Up coding for dataset

In [33]:
# Ranges from doing 60 to 400 passages a second but results may vary
df_coded = df_small.copy()
df_coded = predict(df_coded, labels, passage_name="Passage", tokenizer_kwargs=tokenizer_kwargs)
df_coded.to_excel(directory+"_Node_NLP_predictions.xlsx", index=False)
df_coded

Unnamed: 0,ID,Region,SubRegion,Culture,DocTitle,Section,Author,Page,Year,OCM,...,CAUSE_Witchcraft_Sorcery,CAUSE_Spirits_Gods,CAUSE_Material_Physical,ACTION_Priest_High_Religion,ACTION_Shaman_Medium_Healer,EVENT_Other,ACTION_Divination,EVENT_Accident,ACTION_Technical_Specialist,Model Info
0,1,Africa,Northern Africa,Libyan Bedouin,Writing women's worlds: Bedouin stories,Losing Men,"Abu-Lughod, Lila",57,1993,"[753, 761, 902]",...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,Model: model_5_Roberta/Learning_Rate_2e-05_Wei...
1,2,Africa,Northern Africa,Libyan Bedouin,Writing women's worlds: Bedouin stories,Losing Men,"Abu-Lughod, Lila",59,1993,"[164, 752, 902]",...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,Checkpoint: checkpoint-12450
2,3,Africa,Northern Africa,Libyan Bedouin,Writing women's worlds: Bedouin stories,Losing Men,"Abu-Lughod, Lila",60,1993,"[164, 752, 902]",...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,
3,4,Africa,Northern Africa,Libyan Bedouin,Writing women's worlds: Bedouin stories,Losing Men,"Abu-Lughod, Lila",60,1993,"[164, 752, 902]",...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,
4,5,Africa,Northern Africa,Libyan Bedouin,Writing women's worlds: Bedouin stories,Losing Men,"Abu-Lughod, Lila",61,1993,"[164, 752, 902]",...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
109017,132676,North-America,Plains and Plateau,Pawnee,When stars came down to earth: cosmology of th...,OTHER STARS,"Chamberlain, Von Del",134,1982,"[793, 805, 821]",...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
109018,132679,North-America,Plains and Plateau,Pawnee,When stars came down to earth: cosmology of th...,4 STAGING: THE SKIDI OBSERVATIONAL SYSTEM,"Chamberlain, Von Del",175,1982,"[121, 342, 353, 793]",...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
109019,132680,North-America,Plains and Plateau,Pawnee,When stars came down to earth: cosmology of th...,4 STAGING: THE SKIDI OBSERVATIONAL SYSTEM,"Chamberlain, Von Del",179,1982,"[121, 342, 793]",...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
109020,132682,North-America,Plains and Plateau,Pawnee,The chief and his council: unity and authority...,The United Stars,"Chamberlain, Von Del",229,1992,"[793, 805, 821]",...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,


## Exploration (Optional)


Each of these code chunks is for a different bisection of the data. The only required chunk is the one right below

In [12]:
# load dataset
df = pd.read_excel(directory+"_Node_NLP_predictions.xlsx")
if 'OCM' in df.columns:
    df = OCM_stripper(df)

def label_percentage(df,labels, grouping:str=False, OCM_grouping:list=False):
    df_perc = pd.DataFrame(columns=["grouping","count"])
    df_perc[labels] = np.nan

    # get total percentage per label
    df_perc.loc[0, "grouping"] = "TOTAL"
    df_perc.loc[0, "count"] = len(df)
    df_perc.loc[0, labels] = [df[label].mean() for label in labels]

    # get percentage by group per label
    if grouping is not False:
        assert isinstance(grouping, str), "Need to supply a string for the column group" # assert is used to quickly check if the code is working the way it is, if not, crash on purpose and give this string
        assert OCM_grouping is False, "Cannot do OCM grouping and normal grouping!"
        for index, group in enumerate(df[grouping].unique()):
            df_perc.loc[index+1, "grouping"] = group
            df_perc.loc[index+1, "count"] = len(df.loc[df[grouping]== group])
            df_perc.loc[index+1, labels] = [df.loc[df[grouping]== group][label].mean() for label in labels]
    # Grouping by OCMs. These are unique and cannot group them otherwise
    if OCM_grouping is not False: 
        assert isinstance(OCM_grouping, list), "OCMs must be in a list"
        print("Note, TOTAL will not add up as there is overlap between OCMs")
        for index, OCM in enumerate(OCM_grouping):
            df_perc.loc[index+1, "grouping"] = OCM
            msk = df['OCM'].apply(lambda x: not set(x).isdisjoint([OCM]))
            df_perc.loc[index+1, "count"] = len(df.loc[msk])
            df_perc.loc[index+1, labels] = [df.loc[msk][label].mean() for label in labels]
    return df_perc

### Complete datafile prediction

In [13]:
df_perc = label_percentage(df,labels)
df_perc

Unnamed: 0,grouping,count,ACTION,EVENT,CAUSE
0,TOTAL,66556,0.647455,0.728499,0.46092


### By Culture Prediction

In [14]:
df_perc = label_percentage(df,labels, grouping="culture")
df_perc

Unnamed: 0,grouping,count,ACTION,EVENT,CAUSE
0,TOTAL,66556,0.647455,0.728499,0.460920
1,Libyan Bedouin,1068,0.675094,0.785581,0.462547
2,Shluh,695,0.545324,0.604317,0.326619
3,Kurds,249,0.759036,0.867470,0.493976
4,Kanuri,638,0.650470,0.758621,0.454545
...,...,...,...,...,...
56,Tiv,1933,0.618210,0.783756,0.509053
57,Ojibwa,2733,0.679473,0.802049,0.538968
58,Akan,2192,0.635949,0.812500,0.557482
59,Ifugao,2798,0.442816,0.534310,0.303788


### By dataset

In [15]:
df_perc = label_percentage(df,labels, grouping="Dataset")
df_perc

Unnamed: 0,grouping,count,ACTION,EVENT,CAUSE
0,TOTAL,66556,0.647455,0.728499,0.46092
1,1,6130,0.691517,0.882382,0.710767
2,4,21364,0.660082,0.674125,0.315344
3,2,4809,0.700977,0.876897,0.688501
4,3,34253,0.624179,0.71404,0.475053


### Remove 788 OCM the check by Dataset

In [16]:
df_788removed = df.copy()

msk = (df_788removed['OCM'].apply(lambda x: set(x).isdisjoint(['788'])) | (df_788removed['Dataset'] == 1))
df_788removed = df_788removed.loc[msk].copy()
df_788removed

Unnamed: 0,ID,culture,OCM,passage,ACTION,EVENT,CAUSE,Dataset
0,1,Libyan Bedouin,"[753, 761, 902]","“When Jawwad went in to see him, he shuddered....",0,1,1,1
1,2,Libyan Bedouin,"[164, 752, 902]","He had stepped on a mine. “Watch out, watch ou...",1,1,0,1
2,3,Libyan Bedouin,"[164, 752, 902]","“We used to go out and collect copper,” he beg...",1,1,0,1
3,4,Libyan Bedouin,"[164, 752, 902]",“I was walking along when I found that my shoe...,0,1,0,1
4,5,Libyan Bedouin,"[164, 752, 902]",“It was hissing and there was smoke and after ...,0,1,1,1
...,...,...,...,...,...,...,...,...
66551,132676,Pawnee,"[793, 805, 821]","Next we consider the Seven Stars, the Pleiades...",1,1,0,3
66552,132679,Pawnee,"[121, 342, 353, 793]",The Pleiades offer an interesting illustration...,0,0,0,3
66553,132680,Pawnee,"[121, 342, 793]",The observatory features of the Pawnee house m...,1,0,0,3
66554,132682,Pawnee,"[793, 805, 821]","The Skidi associated another group of stars, w...",1,1,0,3


In [17]:
df_perc = label_percentage(df_788removed,labels, grouping="Dataset")
df_perc

Unnamed: 0,grouping,count,ACTION,EVENT,CAUSE
0,TOTAL,65957,0.645633,0.727838,0.461619
1,1,6130,0.691517,0.882382,0.710767
2,4,21291,0.659387,0.673712,0.315579
3,2,4711,0.697304,0.877733,0.693483
4,3,33825,0.621463,0.713023,0.476098


### Prediction by each OCM


In [18]:
OCMs = ["750", "751", "752", "753", "780", "781", "784", "785", '586' , '684' , '688' , '731' , '732' , '756' , '767' , '777' , '791' , '792' , '793' , '431' , '572' , '594' , '613' , '624' , '675' , '853'] 
# OCMs = ["750", "751", "752", "753", "780", "781", "784", "785", "788"]
df_perc = label_percentage(df,labels, grouping=False, OCM_grouping=OCMs)
df_perc

Note, TOTAL will not add up as there is overlap between OCMs


Unnamed: 0,grouping,count,ACTION,EVENT,CAUSE
0,TOTAL,66556,0.647455,0.728499,0.46092
1,750,406,0.802956,0.881773,0.534483
2,751,1298,0.838983,0.812789,0.553159
3,752,1095,0.736986,0.815525,0.477626
4,753,3460,0.615896,0.933526,0.869075
5,780,169,0.686391,0.704142,0.431953
6,781,387,0.651163,0.847545,0.622739
7,784,3814,0.679077,0.886471,0.736497
8,785,765,0.839216,0.90719,0.614379
9,586,3295,0.724127,0.87648,0.591806
