# Museums in the Pandemic - Extract indicators

**Authors**: Andrea Ballatore (KCL)

**Abstract**: Extract indicators from museum text.

## Setup
This is to check that your environment is set up correctly (it should print 'env ok', ignore warnings).

In [162]:
# Test geospatial libraries
# check environment
import os
print("Conda env:", os.environ['CONDA_DEFAULT_ENV'])
if os.environ['CONDA_DEFAULT_ENV'] != 'mip_v1':
    raise Exception("Set the environment 'mip_v1' on Anaconda. Current environment: " + os.environ['CONDA_DEFAULT_ENV'])

# spatial libraries 
import pandas as pd
import pickle
import spacy
from termcolor import colored
import sys
import numpy as np
from numpy import arange
#import tensorflow as tf
from bs4 import BeautifulSoup
from bs4.element import Comment
#import torch
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

# import from `mip` project
print(os.getcwd())
fpath = os.path.abspath('../')
if not fpath in sys.path:
    sys.path.insert(0, fpath)

out_folder = '../../'

from museums import *
from utils import _is_number
from analytics.text_models import derive_new_attributes_matches, get_all_matches_from_db, get_indicator_annotations

print('env ok')

Conda env: mip_v1
/Users/andreaballatore/Dropbox/DRBX_Docs/Work/Projects/github_projects/museums-in-the-pandemic/mip/notebooks
env ok


## Connect to DB

It needs the DCS VPN active to work.

In [28]:
# open connection to DB
from db.db import connect_to_postgresql_db

db_conn = connect_to_postgresql_db()
print("DB connected")

DB connected


## Extract matches for all museums

Using the best deep learning model defined above, find indicators for all museums (from websites and social media).

### Load deep learning validation model

In [115]:
from keras.models import load_model
from sklearn.preprocessing import MinMaxScaler

def remove_duplicate_matches(df):
    # find duplicates
    n = len(df)
    df = df.drop_duplicates(subset=df.columns.difference(['page_id','sentence_id']))
    print('remove_duplicate_matches:',n,len(df))
    return df

def prep_match_data(df):
    for c in valid_model_columns:
        if not c in df.columns:
            print("Warning: column '{}' is missing, adding a zero column".format(c))
            df[c] = 0
    
    df = remove_duplicate_matches(df)
    df = df[valid_model_columns]
    assert len(df.columns) == 33, len(df.columns)
    num_df = df.select_dtypes(include=[np.number])
    scaler = MinMaxScaler()
    # fit and transform in one step
    cols = num_df.columns
    x_data = pd.DataFrame(scaler.fit_transform(num_df),columns=cols)
    return x_data

def convert_pred_to_bool(vals):
    pred_y = (vals > 0.5).astype("bool")
    # unpack results
    bool_vals = [item for sublist in pred_y for item in sublist]
    return bool_vals

In [68]:
# MODEL COLUMNS
cols_fn = out_folder+"data/analysis/matching_validation/matching_validation_deep_learning_model_columns.csv"
valid_model_columns = pd.read_csv(cols_fn).iloc[:, 0].tolist()

valid_ann_df_fn = 'matches_valid_ann_df_v3.pik'
valid_ann_df = pd.read_pickle(out_folder+'data/annotations/'+valid_ann_df_fn)

valid_match_cnn_model = load_model(out_folder+"data/analysis/matching_validation/matching_validation_deep_learning_model.h5")
valid_match_cnn_model

x_data = prep_match_data(valid_ann_df)
assert len(x_data.columns) == 33, len(x_data.columns)
print(x_data)
pred_valid = convert_pred_to_bool(valid_match_cnn_model.predict(x_data))

valid_ann_df['predicted_valid'] = pred_valid

#valid_ann_df.to_excel(out_folder+"tmp/check_deeplearning.xlsx",index=False)
valid_ann_df.sample(10)

     sem_similarity   token_n   lemma_n  ann_overlap_lemma  ann_overlap_token  \
0          0.659748  0.285714  0.166667           1.000000            1.00000   
1          0.703680  0.428571  0.333333           0.733333            0.75000   
2          0.759582  0.000000  0.000000           0.146667            0.00000   
3          0.726219  0.571429  0.500000           0.644448            0.66667   
4          0.667261  0.000000  0.000000           0.200000            0.00000   
..              ...       ...       ...                ...                ...   
695        0.830129  0.428571  0.333333           0.573333            0.60000   
696        0.712849  0.142857  0.166667           1.000000            0.50000   
697        0.398828  0.000000  0.000000           0.200000            0.00000   
698        0.647905  0.000000  0.000000           0.200000            0.00000   
699        0.661403  0.142857  0.000000           0.288885            0.33333   

     example_len  txt_overl

Unnamed: 0,muse_id,page_id,sentence_id,example_id,indicator_code,session_id,ann_ex_tokens,page_tokens,sem_similarity,token_n,...,indicator_code_project_postpone,indicator_code_reopen_intent,indicator_code_reopen_plan,indicator_code_staff_hiring,indicator_code_staff_restruct,indicator_code_staff_working,overlap_bin,valid_match,valid_match_b,predicted_valid
649,mm.domus.WM037,792242,mus_page792242_sent00003,ann_ex_00228,reopen_intent,20210304,we look forward welcoming,world slowly returns new normal we look forwar...,0.8352,3,...,0,1,0,0,0,0,"(0.45, 1.01]",T,True,True
671,mm.musa.392,116010,mus_page116010_sent00033,ann_ex_00264,staff_hiring,20210304,interested volunteering,volunteer focus magazine corporate partnership...,0.6264,0,...,0,0,0,1,0,0,"(0.45, 1.01]",F,False,False
561,mm.aim.0207,698073,mus_page698073_sent00053,ann_ex_00235,reopen_intent,20210304,we look forward seeing soon,we look forward re opening lockdown lifts whic...,0.8771,3,...,0,1,0,0,0,0,"(0.45, 1.01]",T,True,True
117,mm.musa.356,675180,mus_page675180_sent00062,ann_ex_00267,staff_working,20210304,team continues to work winter months,ok cancel restriction continue cancel your ref...,0.5891,0,...,0,0,0,0,0,1,"(0.0, 0.45]",F,False,False
510,mm.domus.SE439,781347,mus_page781347_sent00081,ann_ex_00191,online_exhib,20210304,online gallery,museum displays oak panelled 16th century old ...,0.5029,1,...,0,0,0,0,0,0,"(0.45, 1.01]",F,False,False
264,mm.ace.1109,516028,mus_page516028_sent00230,ann_ex_00180,online_event,20210304,online events,it becomes an annual event august bank holiday,0.6381,0,...,0,0,0,0,0,0,"(0.45, 1.01]",F,False,False
579,mm.domus.SC246,366843,mus_page366843_sent00106,ann_ex_00257,reopen_plan,20210304,we currently plan to re open,opens new window,0.6268,0,...,0,0,1,0,0,0,"(0.0, 0.45]",F,False,False
516,mm.MDN.006,706877,mus_page706877_sent00143,ann_ex_00232,reopen_intent,20210304,welcome you back in,positive activities grant pag rural funding ba...,0.5686,1,...,0,1,0,0,0,0,"(0.0, 0.45]",F,False,False
592,mm.domus.SE522,113653,mus_page113653_sent00153,ann_ex_00041,closed_indef,20210304,closed further notice,bookshop house shop remain closed further notice,0.8278,3,...,0,0,0,0,0,0,"(0.45, 1.01]",T,True,True
654,mm.musa.258,199540,mus_page199540_sent00120,ann_ex_00225,reopen_intent,20210304,we hope pray can bring some amazing special ev...,rich vibrant bold warm to bring some joy,0.7868,2,...,0,1,0,0,0,0,"(0.0, 0.45]",F,False,False


### Load all matches from DB

- Dump all matches from DB, after running `an_text` on ALL museums for a given crawling session.

In [31]:
# DB columns:
""" 
example_id indicator_code lemma_n lemma_n_wdupl token_n token_n_wdupl criticalwords_n criticalwords_n_wdupl sentence_id  sent_len
example_len example_crit_len ann_overlap_lemma ann_overlap_token ann_overlap_criticwords txt_overlap_lemma
txt_overlap_token ann_ex_tokens ann_ex_tokens page_tokens session_id page_id muse_id keep_stopwords
"""

# load from DB - SLOW
sessions = ['20210304','20210404','20210914']
for session_id in sessions:
    get_all_matches_from_db(session_id, db_conn, out_folder)

get_all_matches_from_db 20210304
query results: (1136440, 17)
	saved ../../tmp/matches_dump_df_20210304.pik
get_all_matches_from_db 20210404
query results: (842167, 17)
	saved ../../tmp/matches_dump_df_20210404.pik


### Predict all matches

In [116]:
sessions = ['20210304','20210404']

def select_valid_matches(df, model):
    """ use Deep Learning model to validate matches """
    x_data = prep_match_data(df)
    
    print('select_valid_matches', x_data.shape)
    # check column order
    assert valid_model_columns == x_data.columns.tolist()
    print(x_data.shape)
    # apply model for predictions
    valid_int = model.predict(x_data)
    pred_valid = convert_pred_to_bool(valid_int)
    #print(type(pred_valid),len(pred_valid))
    df['valid_match'] = pred_valid
    print(df.valid_match.value_counts())
    return df

assert len(valid_model_columns) > 0
allsess_match_df = pd.DataFrame(columns=valid_model_columns)

for session_id in sessions:
    print('> session_id',session_id)
    matches_fn = out_folder+'tmp/matches_dump_df_{}.pik'.format(session_id)
    matchdf = pd.read_pickle(matches_fn)
    matchdf = remove_duplicate_matches(matchdf)
    print("\t", matches_fn, matchdf.shape)
    # apply model to get valid matches
    validmatch_df = select_valid_matches(matchdf, valid_match_cnn_model)
    # save sample to inspect results
    validmatch_df.sample(200).to_csv(out_folder+'tmp/valid_matches_sample_{}.tsv'.format(session_id),sep='\t')
    # save results
    allsess_match_df = pd.concat([allsess_match_df, validmatch_df])

print('all matches:',len(allsess_match_df))

> session_id 20210304
remove_duplicate_matches: 1136440 926928
	 ../../tmp/matches_dump_df_20210304.pik (926928, 41)
remove_duplicate_matches: 926928 926928
select_valid_matches (926928, 33)
(926928, 33)
False    842258
True      84670
Name: valid_match, dtype: int64
> session_id 20210404
remove_duplicate_matches: 842167 729944
	 ../../tmp/matches_dump_df_20210404.pik (729944, 40)
remove_duplicate_matches: 729944 729944
select_valid_matches (729944, 33)
(729944, 33)
False    659234
True      70710
Name: valid_match, dtype: int64
all matches: 1656872


In [131]:
print(allsess_match_df.columns)
print("Matches from DB:")
round(allsess_match_df['valid_match'].value_counts()/len(allsess_match_df),2)

Index(['sem_similarity', 'token_n', 'lemma_n', 'ann_overlap_lemma',
       'ann_overlap_token', 'example_len', 'txt_overlap_lemma',
       'txt_overlap_token', 'ann_overlap_criticwords', 'lemmatoken_n',
       'ann_overlap_tokenlemma', 'txt_overlap_tokenlemma',
       'indicator_code_closed_indef', 'indicator_code_closed_perm',
       'indicator_code_finance_health', 'indicator_code_funding_did_not_get',
       'indicator_code_funding_fundraise', 'indicator_code_funding_gov_emer',
       'indicator_code_funding_other_emer', 'indicator_code_lang_difficulty',
       'indicator_code_made_covid_safe', 'indicator_code_online_engag',
       'indicator_code_online_event', 'indicator_code_online_exhib',
       'indicator_code_open_cafe', 'indicator_code_open_cur',
       'indicator_code_open_onlineshop', 'indicator_code_project_postpone',
       'indicator_code_reopen_intent', 'indicator_code_reopen_plan',
       'indicator_code_staff_hiring', 'indicator_code_staff_restruct',
       'indicator

False    0.91
True     0.09
Name: valid_match, dtype: float64

## Aggregate indicators

In [180]:
assert len(allsess_match_valid_df) > 0
allsess_match_valid_df = allsess_match_df[allsess_match_df.valid_match]
allsess_match_valid_df = remove_duplicate_matches(allsess_match_valid_df)
print("N =", len(allsess_match_valid_df))
print("N museums =", len(allsess_match_valid_df.muse_id.unique()))
allsess_match_valid_df.head(100)

remove_duplicate_matches: 155380 155380
N = 155380
N museums = 2912


Unnamed: 0,sem_similarity,token_n,lemma_n,ann_overlap_lemma,ann_overlap_token,example_len,txt_overlap_lemma,txt_overlap_token,ann_overlap_criticwords,lemmatoken_n,...,indicator_code_staff_working,muse_id,page_id,sentence_id,example_id,indicator_code,session_id,ann_ex_tokens,page_tokens,valid_match
12,0.6864,2.0,2.0,0.25000,0.25000,8,0.40000,0.40000,1.00000,2.0,...,0,mm.domus.SE118,16207.0,mus_page16207_sent00004,ann_ex_00222,reopen_intent,20210304,we will constantly monitor situation look to r...,museum plans to reopen july,True
13,0.7522,1.0,1.0,0.33333,0.33333,3,0.20000,0.20000,1.00000,1.0,...,0,mm.domus.SE118,16207.0,mus_page16207_sent00004,ann_ex_00223,reopen_intent,20210304,we can reopen,museum plans to reopen july,True
14,0.7469,1.0,1.0,0.25000,0.25000,4,0.20000,0.20000,1.00000,1.0,...,0,mm.domus.SE118,16207.0,mus_page16207_sent00004,ann_ex_00234,reopen_intent,20210304,will reopen fully in,museum plans to reopen july,True
15,0.8012,1.0,1.0,0.50000,0.50000,2,0.20000,0.20000,1.00000,1.0,...,0,mm.domus.SE118,16207.0,mus_page16207_sent00004,ann_ex_00244,reopen_intent,20210304,will reopen,museum plans to reopen july,True
17,0.7956,2.0,3.0,0.60000,0.40000,5,0.60000,0.40000,1.00000,3.0,...,0,mm.domus.SE118,16207.0,mus_page16207_sent00004,ann_ex_00254,reopen_plan,20210304,we plan to reopen season,museum plans to reopen july,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1029,0.6732,1.0,1.0,0.25000,0.25000,4,0.05556,0.05556,1.00000,1.0,...,0,mm.domus.EM073,11350.0,mus_page11350_sent00024,ann_ex_00034,closed_cur,20210304,museum currently closed restrictions,events volunteer house interpretation voluntee...,True
1032,0.6408,2.0,2.0,0.66667,0.66667,3,0.11111,0.11111,0.66667,2.0,...,0,mm.domus.EM073,11350.0,mus_page11350_sent00024,ann_ex_00041,closed_indef,20210304,closed further notice,events volunteer house interpretation voluntee...,True
1033,0.7288,2.0,2.0,0.50000,0.50000,4,0.11111,0.11111,0.50000,2.0,...,0,mm.domus.EM073,11350.0,mus_page11350_sent00024,ann_ex_00042,closed_indef,20210304,museum closed further notice,events volunteer house interpretation voluntee...,True
1035,0.7728,2.0,2.0,0.50000,0.50000,4,0.11111,0.11111,0.50000,2.0,...,0,mm.domus.EM073,11350.0,mus_page11350_sent00024,ann_ex_00197,open_cur,20210304,now open careful visitors,events volunteer house interpretation voluntee...,True


In [177]:
# load annotations
indic_df, ann_df = get_indicator_annotations(out_folder)
del indic_df
ann_stats_df = ann_df.groupby(['indicator_code']).size().reset_index(name='n_indic_all_ann_examples')
print(ann_df.head(30))
ann_stats_df

      example_id                                       text_phrases  \
1   ann_ex_00002  closed to members of the public until further ...   
2   ann_ex_00003                       closed until further notice    
5   ann_ex_00006  currently we are closed due to Covid restricti...   
6   ann_ex_00007                                 had to close doors   
10  ann_ex_00011  there will be no services over due to the Covi...   
15  ann_ex_00016  currently closed due to Government Covid restr...   
16  ann_ex_00017                     our office wil not reopen till   
17  ann_ex_00018                         is now closed due to covid   
19  ann_ex_00020  We have made the decision to remain closed to ...   
21  ann_ex_00022  we have taken the hard decision to remain clos...   
22  ann_ex_00023   we have taken the hard decision to remain closed   
23  ann_ex_00024        will be closed during the national lockdown   
24  ann_ex_00025  currently closed in line with Government restr...   
25  an

Unnamed: 0,indicator_code,n_indic_all_ann_examples
0,closed_cur,21
1,closed_indef,4
2,closed_perm,4
3,finance_health,6
4,funding_did_not_get,7
5,funding_fundraise,45
6,funding_gov_emer,11
7,funding_other_emer,4
8,lang_difficulty,37
9,made_covid_safe,1


In [187]:
col_aggr = ['muse_id','session_id','page_id','indicator_code','n_indic_all_ann_examples']

def aggr_indicators_sent(df):
    d = {}
    for c in col_aggr:
        d[c] = df[c].tolist()[0]
    d['n_uniq_sentences'] = df['sentence_id'].nunique()
    d['n_matched_annotations'] = df['example_id'].nunique()
    d['n_matches'] = len(df)
    d['matches_to_sent_ratio'] = round(d['n_matches'] / d['n_uniq_sentences'],3)
    d['matches_to_example_ratio'] = round(d['n_matches'] / d['n_indic_all_ann_examples'],3)
    #d['matches_ratio'] = round(d['n_matches'] / d['n_indic_all_ann_examples'],3)
    return pd.Series(d)

n = len(allsess_match_valid_df)
allsess_match_valid_df2 = allsess_match_valid_df.merge(ann_stats_df, on='indicator_code')
assert n == len(allsess_match_valid_df2)

muse_indic_sent_df = allsess_match_valid_df2.groupby(col_aggr).apply(aggr_indicators_sent)
print(muse_indic_sent_df.columns)
muse_indic_sent_df.reset_index(drop=True, inplace=True)
muse_indic_sent_df = muse_indic_sent_df.sort_values(['session_id','muse_id','indicator_code'])
#print(muse_indic_sent_df.session_id.value_counts())
muse_indic_sent_df.to_excel(out_folder+'tmp/museum_indicators_sent_stats-v1.xlsx', index=False)
muse_indic_sent_df.head(30)

Index(['muse_id', 'session_id', 'page_id', 'indicator_code',
       'n_indic_all_ann_examples', 'n_uniq_sentences', 'n_matched_annotations',
       'n_matches', 'matches_to_sent_ratio', 'matches_to_example_ratio'],
      dtype='object')


Unnamed: 0,muse_id,session_id,page_id,indicator_code,n_indic_all_ann_examples,n_uniq_sentences,n_matched_annotations,n_matches,matches_to_sent_ratio,matches_to_example_ratio
0,domus.NE043,20210304,643977.0,closed_cur,21,5,1,5,1.0,0.238
1,domus.NE043,20210304,643977.0,closed_indef,4,2,1,2,1.0,0.5
2,domus.NE043,20210304,643977.0,online_engag,12,2,2,2,1.0,0.167
3,domus.NE043,20210304,643977.0,open_cur,3,1,1,1,1.0,0.333
4,domus.NE043,20210304,643977.0,project_postpone,3,1,1,1,1.0,0.333
5,domus.NE043,20210304,643977.0,reopen_intent,35,4,1,4,1.0,0.114
6,mm.MDN.006,20210304,706877.0,closed_cur,21,7,8,11,1.571,0.524
7,mm.MDN.006,20210304,706877.0,funding_fundraise,45,2,1,2,1.0,0.044
8,mm.MDN.006,20210304,706877.0,online_engag,12,26,7,61,2.346,5.083
9,mm.MDN.006,20210304,706877.0,online_event,8,1,2,2,2.0,0.25


## Get museum sample to inspect results

- Inspect the summary file on the rows that have been assigned to you (evaluator column). These rows represent indicators for museums in a given session, only including matches that have been considered valid by the system.
- For each row in the summary file, you can find the corresponding matches in the matches file. Please check them for quality, comparing as usual the annotation tokens with the page tokens.
- I realise that there are a lot of rows, so you can select a subsample of your choice. This is an exploratory task, so it’s mainly to check if the results make sense and the accuracy seems to be about 80% (basically, max 20% false positives). If needed, we can do also it as a formal evaluation. 

In [189]:
musem_sample_ids = muse_indic_sent_df.muse_id.sample(10,random_state=10)
muse_sample_df = muse_indic_sent_df[muse_indic_sent_df.muse_id.isin(musem_sample_ids)]
muse_sample_df.to_excel(out_folder+'tmp/museum_website_match_sample10_summary.xlsx',index=False)
allsess_match_valid_df[allsess_match_valid_df.muse_id.isin(musem_sample_ids)].to_excel(out_folder+'tmp/museum_website_match_sample10_matches.xlsx',index=False)

muse_sample_df

Unnamed: 0,muse_id,session_id,page_id,indicator_code,n_indic_all_ann_examples,n_uniq_sentences,n_matched_annotations,n_matches,matches_to_sent_ratio,matches_to_example_ratio
4261,mm.aim.0788,20210304,375954.0,closed_cur,21,2,9,11,5.500,0.524
4262,mm.aim.0788,20210304,375954.0,funding_fundraise,45,1,1,1,1.000,0.022
4263,mm.aim.0788,20210304,375954.0,online_engag,12,3,6,7,2.333,0.583
4264,mm.aim.0788,20210304,375954.0,open_cur,3,1,1,1,1.000,0.333
4265,mm.aim.0788,20210304,375954.0,reopen_intent,35,2,17,17,8.500,0.486
...,...,...,...,...,...,...,...,...,...,...
24384,mm.musa.242,20210404,271710.0,online_engag,12,3,2,3,1.000,0.250
24385,mm.musa.242,20210404,271710.0,open_cur,3,3,2,3,1.000,1.000
24386,mm.musa.242,20210404,271710.0,reopen_intent,35,1,1,1,1.000,0.029
24387,mm.musa.242,20210404,271710.0,reopen_plan,4,1,2,2,2.000,0.500


End of notebook