# Stats for eHRAF Scraper
The current stats are merely to 
- Clean and reorganize the dataframe 
- Find the most common OCM codes
- Find association rules (when one OCM appears this other OCM is likely to appear)
More work will be done


## Clean the Dataframe


In [338]:
import pandas as pd                 # dataframe storing
import re                           # regex for searching through strings

# put file name here
# file = r'subjects-(bodily_injuries_OR_preventive_medicine)_FILTERS-culture_level_samples(PSF)_web_data.xlsx'
file = r'subjects~(%22sickness%22)|PSF_web_data.xlsx'
df = pd.read_excel('../Data/' + file)
# Turn the string of column OCM back into a list 
df['OCM'] = df.OCM.apply(lambda x: re.sub(" ",'',x))
df['OCM'] = df.OCM.apply(lambda x: x[1:-1].split(','))

# did it work? did it output a single OCM string?
df['OCM'][0][0]

'428'

In [339]:
# drop all rows who have a blank passage
print(f'Before: {len(df)}')
df = df.dropna(subset="Passage")
print(f'After: {len(df)}')

Before: 42555
After: 42458


In [340]:
# If you use a higher order code (750) eHRAF attempts to aquire ALL OCMs related to your input.
# select only the OCMs we originally wished to search for (750-754)
lst = ["750","751","752","753","754"]
# lst = ["751", "752"]
msk = df['OCM'].apply(lambda x: not set(x).isdisjoint(lst))
df = df.loc[msk]
len(df)

14866

In [341]:
# Find the number of passages for each culture BEFORE we reorganize the dataframe. This is later used for OCM code counting as the total
culture_set = set(df["Culture"])
culture_dict = {}
it_count = 0
for cult_i in culture_set:
    row_count = len(df.loc[df["Culture"]==cult_i])
    culture_dict[cult_i] = row_count
    it_count += row_count
print(f'Passages: {it_count}')


Passages: 14866


### Shave OCMs
And make an exploded OCM dataframe

In [342]:
# Make a dataset in which each OCM have its own row by exploding (you can reset the index with .reset_index(drop=True))
df_OCM = df.explode(column='OCM').reset_index(drop=True)
# Find OCM's that do not fit the normal 100-900 OCM scheme
# NOTE 0 means the material is not relevant, I am unsure, however, why this sometimes appears with other OCM's in the same passage
# NOTE I believe 5310 and 5311 are different specifications of 531 while 1710 might be a more specific (and singlular) subset of 171? I do not believe the same for 77 and 1787
list_OCM = df_OCM['OCM'].value_counts().index.tolist()
small_OCM = [x for x in list_OCM if len(x) <3 or len(x) > 3]
small_OCM

['0', '5311', '5310']

In [343]:
# remove and shave OCM codes
# add to the list for codes which should be removed
remove_list = ['1787','77']
for i in remove_list:
    df_OCM = df_OCM[df_OCM["OCM"].str.contains(i) == False]
# "Shave" the OCM codes that seem to have a parent (5310 and 5311 become 531).
df_OCM['OCM'] = df_OCM.OCM.apply(lambda x: x[0:3] if len(x) >= 3 else x)
len(df_OCM)

48299

In [311]:
df_demo = df_OCM.copy()
df_demo = df_demo.groupby("Passage").agg({'OCM': lambda x: x.tolist()}).reset_index()
df_demo

Unnamed: 0,Passage,OCM
0,#8,"[256, 291, 354, 407, 415, 417, 751]"
1,'M-bwat. | The extraction of teeth. N[unknown]...,"[113, 174, 237, 304, 626, 752]"
2,( a ) Love charms. — (1) The most common form ...,"[224, 226, 231, 278, 302, 415, 524, 531, 726, ..."
3,() Kurama are one of the tribes of eastern Zar...,[751]
4,(1) Seven needles that represent the seven “de...,[751]
...,...,...
2489,“Yeah.”,"[752, 876, 900]"
2490,“You okay? Rope strong enough?”,"[358, 752, 900]"
2491,"“You'd better go see Brother Laflamme,” Father...","[481, 517, 752, 900]"
2492,“[JMR: from Lieutenant Nordquist's report] ‘Pi...,"[264, 515, 751]"


```df_demo = df.copy():``` This creates a copy of the DataFrame df and assigns it to a new variable df_demo. This is done to avoid modifying the original DataFrame.

```dup1 = df_demo["Passage"].duplicated(keep=False):``` This creates a Boolean mask dup1 indicating whether each element in the "Passage" column of the DataFrame df_demo is duplicated or not. The keep=False argument tells the function to mark all duplicates as True in the resulting Boolean mask.

```dup3 = df_demo[dup1].duplicated(subset=["Passage", "DocTitle"],``` keep=False): This creates a Boolean mask dup3 indicating whether each row of the DataFrame df_demo is duplicated or not based on the subset of columns specified, in this case the "Passage" and "DocTitle" columns. The keep=False argument tells the function to mark all duplicates as True in the resulting Boolean mask.

```df_demo.drop(df_demo[dup1][~dup3].index, inplace=True):``` This drops all the duplicate rows found using dup1 and dup3 from the dataframe df_demo and updates the original dataframe. The ~dup3 is used to select the all the rows which are not present in dup3, so it will drop only the duplicates based on the column "Passage" and "DocTitle"

In summary, this code removes all duplicates based on the column "Passage" and "DocTitle" from the DataFrame df and assigns the cleaned DataFrame to df_demo.

In [418]:
# Find all passages which are duplicates. First let's explore some of the duplicates
df_demo = df.copy()
dup1 = df_demo["Passage"].duplicated(keep=False)
dup2 = df_demo[dup1].duplicated(subset=["Passage", "DocTitle"], keep=False)
# rows which contain duplicate passages but not part of the same document (only top 4 shown)
df_demo[dup1][~dup2].sort_values(by='Passage').head(4)

Unnamed: 0,Region,SubRegion,Culture,DocTitle,Year,OCM,OWC,Passage,run_Info
12016,Asia,South Asia,Santal,"The hill of flutes: life, love, and poetry in ...",1974,"[754, 755, 782]",aw42,"If a bonga has been buried by a witch, the fir...",
11724,Asia,South Asia,Santal,Tribal law and justice: a report on the Santal,1984,"[754, 755, 782]",aw42,"If a bonga has been buried by a witch, the fir...",
11660,Asia,South Asia,Santal,Tribal law and justice: a report on the Santal,1984,[754],aw42,If no member of the inner family or household ...,
11987,Asia,South Asia,Santal,"The hill of flutes: life, love, and poetry in ...",1974,[754],aw42,If no member of the inner family or household ...,


In [425]:
df_demo2 = df.copy()
df_demo2["OCM"] = df_demo2['OCM'].apply(tuple)

dup3 = df_demo2[dup1].duplicated(subset=["Passage", "OCM"], keep=False)
df_demo2[dup1][~dup3].sort_values(by="Passage")

59

In [407]:
# Drop all duplicate passages (keep at least one)
dup4 = df_demo[dup1].duplicated(subset=["Passage", "DocTitle"], keep='first')
df_demo.drop(df_demo[dup4].index, inplace=True)
len(df_demo)

### Optional Exploration

In [294]:
# (OPTIONAL)
# Quick search for OCMs
# NOTE, sometimes a higher order code like 750 appears without lower order codes)
lst = ["751","752"] #enter your OCM strings here separated by a comma
msk = df['OCM'].apply(lambda x: not set(x).isdisjoint(lst))
out = df.loc[msk]
out

Unnamed: 0,Region,SubRegion,Culture,DocTitle,Year,OCM,OWC,Passage,run_Info
4,Africa,Central Africa,Azande,Culture summary: Azande,1999,"[751, 753, 757, 789]",fo07,Zande apply generally known common-sense cures...,Run URL: https://ehrafworldcultures.yale.edu/s...
21,Africa,Central Africa,Azande,An account of the Zande,1926,"[751, 846]",fo07,“A newly-born child is decorated with necklace...,
22,Africa,Central Africa,Azande,An account of the Zande,1926,"[751, 846]",fo07,It is said that a baby is a gift from the spir...,
24,Africa,Central Africa,Azande,An account of the Zande,1926,"[751, 821]",fo07,“Water seems to extinguish the mystical virtue...,
30,Africa,Central Africa,Azande,An account of the Zande,1926,[752],fo07,“If a man be wounded by a splinter of wood or ...,
...,...,...,...,...,...,...,...,...,...
42503,South-America,Southern South America,Ona,"The Fireland Indians: Vol. 1. The Selk'nam, on...",1931,"[532, 752, 787, 869, 881]",sh04,The men now remain firmly convinced that the f...,
42506,South-America,Southern South America,Ona,The Ona,1946,"[302, 751, 822]",sh04,Smearing the head and body with grease served ...,
42513,South-America,Southern South America,Ona,The Ona,1946,"[752, 757]",sh04,Herbal curatives were lacking. Massage was a c...,
42516,South-America,Southern South America,Ona,Drama and power in a hunting society: the Selk...,1982,"[113, 302, 515, 751]",sh04,"In normal times, when no special event was tak...",


In [295]:
# (OPTIONAL)
# There are some passages that describe previous passages but do not contain information themselves like: 
# "Notes" or "End" or "Log"
# This code cell indicates (but does not remove) how many passages are short like the ones described which 
# may disrupt our OCM stats because they contain OCMs without actually having text that refers to these OCMs
shortPass_list = []
for i in df['Passage']:
    if len(i)<=10:
        shortPass_list.append(i)
print(f'Number of passages with text with 10 of fewer characters: {len(shortPass_list)}')

Number of passages with text with 10 of fewer characters: 6


## OCM Code Counting
Count every OCM within each culture. Do not count OCM's specified by the search (like if searched for 750-755, do not count these). 
<!-- - REMOVE all passages which are blank since we can't very well do lexical searches on them -->

In [296]:
# Make a copy of df_OCM as to not interfere with other analysis
df_OCM_freq = df_OCM.copy()
# Then turn the OCM's back to an integer (for removals)
df_OCM_freq['OCM'] = df_OCM_freq.OCM.apply(lambda x: int(x))
# only keep OCMs outside our search parameters whatever those are
df_sub_ex = df_OCM_freq.loc[(df_OCM_freq["OCM"]<750) | (df_OCM_freq["OCM"]>754)]

# Overwrite and create a new dataframe for OCM counts and frequencies
df_OCM_freq = pd.DataFrame(columns=["Culture","OCM","Frequency","Proportion_of_Passages"])
for key, val in culture_dict.items():
    value_count = df_sub_ex.loc[df_sub_ex["Culture"]==key]["OCM"].value_counts()
    # duplicate the culture word and asign it to each of its rows
    cult_count = [key] * len(value_count)
    # create a culture dataframe and append it to to the 
    df_OCM_Concat = pd.DataFrame({"Culture":cult_count,"OCM":value_count.index, "Frequency":value_count.values, "Proportion_of_Passages":value_count.values/val})
    df_OCM_freq = pd.concat([df_OCM_freq, df_OCM_Concat], ignore_index=True)
df_OCM_freq

Unnamed: 0,Culture,OCM,Frequency,Proportion_of_Passages
0,Highland Scots,758,4,0.571429
1,Highland Scots,871,3,0.428571
2,Highland Scots,164,2,0.285714
3,Highland Scots,514,2,0.285714
4,Highland Scots,175,2,0.285714
...,...,...,...,...
2444,Pawnee,725,1,0.032258
2445,Pawnee,701,1,0.032258
2446,Pawnee,622,1,0.032258
2447,Pawnee,797,1,0.032258


In [297]:
df_OCM_freq["Proportion_of_Passages"].sum()/len(set(df_OCM_freq["Culture"]))

3.1115613913020703

In [236]:
# Save the file
df_OCM_freq.to_csv("Culture_Frequency.csv", index=False)

## Association Rules for OCMs
Using Machine Learning, we will attempt to determine the co-occurance of OCMs. For example, if the OCM code 262 is present, what is the likelihood that both 751 and 752 would be present?

In [237]:
# Make a dataset in which each OCM have its own row by exploding (you can reset the index with .reset_index(drop=True))
df_OCM = df.explode(column='OCM').reset_index(drop=True)
df_OCM

Unnamed: 0,Region,SubRegion,Culture,DocTitle,Year,OCM,OWC,Passage,run_Info
0,Africa,Central Africa,Azande,Culture summary: Azande,1999,428,fo07,"The property of commoners, their wives, and an...",User: No Name Specified
1,Africa,Central Africa,Azande,Culture summary: Azande,1999,754,fo07,"The property of commoners, their wives, and an...",User: No Name Specified
2,Africa,Central Africa,Azande,Culture summary: Azande,1999,626,fo07,Day-to-day behavior is largely governed by the...,Run Time: 11:43:43
3,Africa,Central Africa,Azande,Culture summary: Azande,1999,681,fo07,Day-to-day behavior is largely governed by the...,Run Time: 11:43:43
4,Africa,Central Africa,Azande,Culture summary: Azande,1999,684,fo07,Day-to-day behavior is largely governed by the...,Run Time: 11:43:43
...,...,...,...,...,...,...,...,...,...
53412,South-America,Southern South America,Ona,Drama and power in a hunting society: the Selk...,1982,756,sh04,"The shamans, called xo'on, had great prestige ...",
53413,South-America,Southern South America,Ona,Drama and power in a hunting society: the Selk...,1982,787,sh04,"The shamans, called xo'on, had great prestige ...",
53414,South-America,Southern South America,Ona,Drama and power in a hunting society: the Selk...,1982,682,sh04,"It should be noted, of course, that in this cu...",
53415,South-America,Southern South America,Ona,Drama and power in a hunting society: the Selk...,1982,754,sh04,"It should be noted, of course, that in this cu...",


In [238]:
# Load resources
from mlxtend.preprocessing import TransactionEncoder

# We will use the apriori module to generate a dataframe that
# we can use for association rule finding
from mlxtend.frequent_patterns import apriori

# We will use the association_rules module to generate
# our association rules from the apriori output data frame
from mlxtend.frequent_patterns import association_rules





In [239]:
#Display important columns
df_smaller = df_OCM[['Culture', 'OCM','Passage']]
df_smaller

Unnamed: 0,Culture,OCM,Passage
0,Azande,428,"The property of commoners, their wives, and an..."
1,Azande,754,"The property of commoners, their wives, and an..."
2,Azande,626,Day-to-day behavior is largely governed by the...
3,Azande,681,Day-to-day behavior is largely governed by the...
4,Azande,684,Day-to-day behavior is largely governed by the...
...,...,...,...
53412,Ona,756,"The shamans, called xo'on, had great prestige ..."
53413,Ona,787,"The shamans, called xo'on, had great prestige ..."
53414,Ona,682,"It should be noted, of course, that in this cu..."
53415,Ona,754,"It should be noted, of course, that in this cu..."


In [240]:
df_group = df_smaller.groupby(by = ['Culture', 'Passage'])
df_group

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fc796a16b90>

In [241]:
def make_OCM_list(x):

    '''
    Will return a list of the unique items
    in a particular grouping when used with
    the agg method as its function
    '''

    return x.unique()

In [242]:
# Use the agg method and make_OCM_list
# to return a list of unique items for each ocm
df_unique = df_group.agg(make_OCM_list)

In [243]:
list_trans = list(df_unique['OCM'])
list_trans = list_trans[0:]
len(list_trans)

14642

In [244]:
te = TransactionEncoder()
encoded_itemset = te.fit(list_trans).transform(list_trans)
print(encoded_itemset.shape) # show possible transcations and number of items
te.columns_



df_encoded = pd.DataFrame(encoded_itemset, columns = te.columns_)
df_encoded.head()

(14642, 557)


Unnamed: 0,0,101,102,103,104,105,112,113,114,115,...,885,886,887,888,890,900,901,902,903,984
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [245]:
# Before we begin, let's do a small
# amount of cleanup.  Let's remove all
# columns (items) that have less than 1 characters since that is just blank space
# more data cleaning my be required as time continues in case errors become evident in the scraped dataset
OCM_items = list(filter(lambda x: len(x) < 1, te.columns_ ))
print("removed: ",  OCM_items)
df_encoded = df_encoded.drop(columns=OCM_items) #remove small strings as they seem not to be items
print('How many unique items are left?', len(df_encoded.columns))

removed:  []
How many unique items are left? 557


In [246]:
# Use apriori to create a dataframe with columns of support and itemset lists
# Note that if your items are large compared to your sample (you have few rows but many columns) I reccommend using 
# a higher min_support as many more combinations may have spuriously higher support. Also, you can crash the program if too many are selected
df_support = apriori(df_encoded, min_support=0.01, use_colnames=True)
df_support.sort_values('support', inplace=True, ascending = False)
df_support

Unnamed: 0,support,itemsets
28,0.588239,(754)
27,0.247371,(753)
40,0.131335,(776)
29,0.111050,(755)
3,0.107567,(159)
...,...,...
185,0.010313,"(753, 782, 776)"
108,0.010176,"(753, 825)"
54,0.010176,(827)
8,0.010176,(182)


## Use association_rules to find the rules

Using the dataframe generated by `apriori`, find the association rules with the greatest lift.  See the [association_rules documentation](https://rasbt.github.io/mlxtend/api_modules/mlxtend.frequent_patterns/association_rules/) for how to do this.

Sort the resulting DataFrame by lift in descending order.  A lift > 1 indicates that the items are often purchased together and that buying X will increase the purchase of Y.  A lift of < 1 indicates the items are often substituted.  That is X is substituted for Y so X and Y don't appear together often.

Examine the resulting DataFrame.  For the association rule X -> Y, X is the column `antecedents` and Y is the column `consequents`.  If sorted you can see the metrics for each rule based upon the lift.

In [247]:
# Find the association rules
rules = association_rules(df_support, metric = 'lift', min_threshold=1.0)
# lift >1 more likely than chance X means you see Y
# lift = 1 as often as chance
# lift <1 (substitution) less likely than chance X means you see Y


In [248]:
# Sort the rules by lift
# and examine the output
# to find what rules were
# discovered
rules.sort_values('lift', ascending=False, inplace =True)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
5420,"(793, 755, 787, 775)","(539, 782, 776, 752)",0.012362,0.013181,0.010518,0.850829,64.548364,0.010355,6.615340
5443,"(539, 782, 752)","(755, 776, 787, 775, 793)",0.013181,0.012362,0.010518,0.797927,64.548364,0.010355,4.887543
4145,"(793, 755, 787, 775)","(539, 782, 752)",0.012362,0.013181,0.010518,0.850829,64.548364,0.010355,6.615340
5358,"(755, 776, 787, 775, 793)","(539, 782, 752)",0.012362,0.013181,0.010518,0.850829,64.548364,0.010355,6.615340
5381,"(539, 782, 776, 752)","(793, 755, 787, 775)",0.013181,0.012362,0.010518,0.797927,64.548364,0.010355,4.887543
...,...,...,...,...,...,...,...,...,...
3699,(754),(828),0.588239,0.019738,0.011952,0.020318,1.029405,0.000341,1.000592
25,(754),(778),0.588239,0.046783,0.027933,0.047486,1.015030,0.000414,1.000738
24,(778),(754),0.046783,0.588239,0.027933,0.597080,1.015030,0.000414,1.021942
6206,(826),(754),0.017416,0.588239,0.010313,0.592157,1.006660,0.000068,1.009606


In [252]:
# look for OCM codes within the list
lst = frozenset(["793","226"])
msk = rules['antecedents'].apply(lambda x: not set(x).isdisjoint(lst))
out = rules.loc[msk]
out

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
5420,"(793, 755, 787, 775)","(539, 782, 776, 752)",0.012362,0.013181,0.010518,0.850829,64.548364,0.010355,6.615340
4145,"(793, 755, 787, 775)","(539, 782, 752)",0.012362,0.013181,0.010518,0.850829,64.548364,0.010355,6.615340
5358,"(755, 776, 787, 775, 793)","(539, 782, 752)",0.012362,0.013181,0.010518,0.850829,64.548364,0.010355,6.615340
5354,"(782, 755, 787, 775, 793)","(539, 776, 752)",0.012362,0.013250,0.010518,0.850829,64.215640,0.010354,6.614883
3831,"(793, 755, 787, 775)","(539, 776, 752)",0.012362,0.013250,0.010518,0.850829,64.215640,0.010354,6.614883
...,...,...,...,...,...,...,...,...,...
228,(793),(752),0.033260,0.077995,0.017689,0.531828,6.818755,0.015095,1.969371
66,"(755, 793)",(776),0.023426,0.131335,0.020830,0.889213,6.770595,0.017754,7.840849
839,"(793, 787)",(776),0.016869,0.131335,0.014752,0.874494,6.658523,0.012537,6.921302
41,(793),(755),0.033260,0.111050,0.023426,0.704312,6.342274,0.019732,3.006378
