# Museums in the Pandemic - 00 tests

**Author**: Andrea Ballatore (Birkbeck, University of London)

**Abstract**: notebook for testing

## Setup
This is to check that your environment is set up correctly (it should print 'env ok', ignore warnings).

In [1]:
# Test geospatial libraries
# check environment
import os
print("Conda env:", os.environ['CONDA_DEFAULT_ENV'])
if os.environ['CONDA_DEFAULT_ENV'] != 'mip_v1':
    raise Exception("Set the environment 'mip_v1' on Anaconda. Current environment: " + os.environ['CONDA_DEFAULT_ENV'])

# spatial libraries 
import pandas as pd
import pickle
from termcolor import colored
import sys
import spacy
import numpy as np
#import tensorflow as tf
from bs4 import BeautifulSoup
from bs4.element import Comment
#import torch
import matplotlib.pyplot as plt

# import from `mip` project
print(os.getcwd())
fpath = os.path.abspath('../')
if not fpath in sys.path:
    sys.path.insert(0, fpath)

out_folder = '../../'
    
print('env ok')

Conda env: mip_v1
C:\Users\VV\workspace1\museums-in-the-pandemic\mip\notebooks
env ok


# Museum text analytics


## Model1: Vectorise text from museum websites

### Connect to DB

In [2]:
# open connection to DB
from db.db import connect_to_postgresql_db

db_conn = connect_to_postgresql_db()
print("DB connected")

DB connected


### Setup spacy NLP

In [3]:
# install language model
!python -m spacy download en_core_web_sm
# Note: if this cell does not work, run the same command 
#       without "!" in the Anaconda terminal

Collecting en_core_web_sm==2.3.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz (12.0 MB)
[+] Download and installation successful
You can now load the model via spacy.load('en_core_web_sm')


In [4]:
# set up the spacy environment
import spacy
from spacy import displacy
from collections import Counter
spacy.prefer_gpu()
# load language model
import en_core_web_sm
nlp = en_core_web_sm.load()

In [7]:
# Spacy stopwords
all_stopwords = nlp.Defaults.stop_words

In [5]:
# get text from websites

from analytics.an_websites import get_attribute_for_webpage_url

session_id = '20210420'
test_urls = ['https://www.britishmuseum.org/']
attrib_name = 'all_text' # 'title'

for url in test_urls:
    print(url)
    res = get_attribute_for_webpage_url(url, session_id, attrib_name, db_conn)
    if not res: continue
    print(res)
    blocks = res.split("\n")
    print("LEN", len(blocks))
    print(blocks)

https://www.britishmuseum.org/
Skip to main content Please enable JavaScript in your web browser to get the best experience. We use cookies to make our website work more efficiently, to provide you with more personalised services or advertising to you, and to analyse traffic on our website. For more information on how we use cookies and how to manage cookies, please follow the 'Read more' link, otherwise select 'Accept and close'. Read more about our cookie policy Accept and close the cookie policy Menu Main navigation Visit Toggle Visit submenu Back
to previous menu —
Visit —
Visit —
Family visits —
Group visits —
Audio guide —
Out-of-hours tours —
Tours and talks —
Object trails —
Accessibility —
Food and drink —
Late opening on Fridays —
Museum map Exhibitions and events Collection Toggle Collection submenu Back
to previous menu —
Collection —
Collection —
Collection online —
Galleries —
Blog —
Audio tour highlights —
The British Museum podcast Learn Toggle Learn submenu Back
to pre

### Preprocess text

In [5]:
def spacy_extract_tokens(text):
    """ 
    @returns data frame with tokens with POS, lemma, stop words
    """
    tokens_df = pd.DataFrame()
    text_sentences = nlp(text)
    sent_id = 0
    # segment sentences
    for sentence in text_sentences.sents:
        sent_id += 1
        # for each sentence
        snt_text = sentence.text
        pos_df = pd.DataFrame()
        #print(colored('>', 'red'), snt_text)
        for token in sentence:
            # for each token
            tokens_df = tokens_df.append(pd.DataFrame(
                {"sentence_id": sent_id, "token":token.text, 'lemma':token.lemma_,
                 "pos_tag":token.pos_, 'is_stop': token.is_stop}, 
                index=[0]), ignore_index=True)
    return tokens_df

test_texts = ["""We need your support Your support is vital and helps the Museum to share the collection with the world. Make a donation What's online... The flowers of Mary Delany 233 years after her death, Delany's detailed floral collages still delight and inspire. Take a closer look at her work in the collection. How to explore the British Museum from home Whether it's a behind-the-scenes podcast or a closer look at our galleries, here are 10 ways to explore the Museum while we're closed. British histories beyond 'Bridgerton' Inspired by the hit Netflix show, watch a panel discussion exploring the reality behind the fantasy of 'Bridgerton'. Discover the Maya World Take a trip to Mexico and explore a wealth of content from the Maya Research Project, including stories, videos and 3D explorations."""]

for tt in test_texts:
    print(tt)
    print("")
    df = spacy_extract_tokens(tt)
    print("Tokens N =",len(df))
    fout = out_folder+'tmp/museum_text_tokens.csv'
    df.to_csv(fout, index=False)
    print("See tokens in",fout)


We need your support Your support is vital and helps the Museum to share the collection with the world. Make a donation What's online... The flowers of Mary Delany 233 years after her death, Delany's detailed floral collages still delight and inspire. Take a closer look at her work in the collection. How to explore the British Museum from home Whether it's a behind-the-scenes podcast or a closer look at our galleries, here are 10 ways to explore the Museum while we're closed. British histories beyond 'Bridgerton' Inspired by the hit Netflix show, watch a panel discussion exploring the reality behind the fantasy of 'Bridgerton'. Discover the Maya World Take a trip to Mexico and explore a wealth of content from the Maya Research Project, including stories, videos and 3D explorations.

Tokens N = 155
See tokens in ../../tmp/museum_text_tokens.csv


### Annotations

In [10]:
# extract tokens from annotations
from analytics.text_models import get_indicator_annotations

indic_df, ann_df = get_indicator_annotations("../../")
ann_df

Unnamed: 0,text_phrases,indicator_code,indicator_attributes,notes,example_id
0,Closed now,closed_cur,,,0
1,closed to members of the public until further ...,closed_cur,,,1
2,closed until further notice,closed_cur,,,2
3,closed until Spring 2021,closed_cur,,,3
4,Closed: Until further notice,closed_cur,,,4
...,...,...,...,...,...
146,Bailiffgate is now closed due to Covid.,closed_cur,,,146
147,Visit our online shop,open_onlineshop,,,147
148,The new stunning Felton Group online exhibitio...,online_exhib,_description,,148
149,Stay up to date with what's happening at Baili...,online_engag,,,149


In [11]:
ann_tokens_df = pd.DataFrame()

for index, row in ann_df.iterrows():
    txt = str(row['text_phrases']).strip()
    df = spacy_extract_tokens(txt)
    #print(df)
    df['example_id'] = row['example_id']
    df['indicator_code'] = row['indicator_code']
    ann_tokens_df = pd.concat([ann_tokens_df, df])

# output annotations tokens
fout = out_folder+'tmp/test_annotations_tokens.csv'
ann_tokens_df.to_csv(fout, index=False)
print(fout)

[31m>[0m Closed now
[31m>[0m closed to members of the public until further notice.
[31m>[0m closed until further notice
[31m>[0m closed until Spring 2021
[31m>[0m Closed: Until further notice
[31m>[0m Currently Closed
[31m>[0m Currently we are closed due to Covid restrictions
[31m>[0m had to close its doors
[31m>[0m Murton Park is now closed for the Winter season
[31m>[0m Museum closed
[31m>[0m temporally closed
[31m>[0m The Gallery is now closed for winter.
[31m>[0m There will be no services over the Christmas period due to the Covid-19 pandemic
[31m>[0m to temporarily close
[31m>[0m we have closed [the railway]
[31m>[0m We’re closed
[31m>[0m now closed for the foreseeable future
[31m>[0m dismantling of some of our displays
[31m>[0m scheduled to be re-located in the near future
[31m>[0m SUSPEND ALL MUSEUM EVENTS for the foreseeable future
[31m>[0m will close from 28 February 2021 for the foreseeable future
[31m>[0m We will be however be open

In [6]:
from museums import get_museums_w_web_urls
from analytics.an_websites import get_attribute_for_webpage_url
df= get_museums_w_web_urls("../../")

datadict={}
session_id='20210614'
attrib_name = 'all_text'
for index, row in df.iterrows():
    datadict[row['muse_id']]=get_attribute_for_webpage_url(row['url'], session_id, attrib_name, db_conn)
tokeniseddict={}
for key, value in datadict.items():
    if value is not None:
        tokeniseddict[key]=spacy_extract_tokens(value)
data_items = tokeniseddict.items()
data_list = list(data_items)
print(tokeniseddict)
datadf=pd.DataFrame(data_list)
datadf.to_csv('../../tmp/a_x_test2.tsv', sep='\t')



museums urls: ../../data/museums/museum_websites_urls-v3.tsv
get_museums_w_web_urls Museums=3344 URLs=3344
select url, a.page_id, attrib_name, attrib_val from websites.web_pages_dump_20210614 p, websites.web_pages_dump_20210614_attr a where a.page_id = p.page_id 
        and p.url = 'https://www.100bgmus.org.uk/' and a.attrib_name = 'all_text';
select url, a.page_id, attrib_name, attrib_val from websites.web_pages_dump_20210614 p, websites.web_pages_dump_20210614_attr a where a.page_id = p.page_id 
        and p.url = 'https://www.english-heritage.org.uk/visit/places/1066-battle-of-hastings-abbey-and-battlefield/' and a.attrib_name = 'all_text';
select url, a.page_id, attrib_name, attrib_val from websites.web_pages_dump_20210614 p, websites.web_pages_dump_20210614_attr a where a.page_id = p.page_id 
        and p.url = 'https://www.doningtonleheath.org.uk/' and a.attrib_name = 'all_text';
select url, a.page_id, attrib_name, attrib_val from websites.web_pages_dump_20210614 p, websites.w

For each mus id, sessionid
get all urls()
for each url:
get attribute all text 

on attribute all text call tokenization
produce df


End of notebook