# Genizah Medical Data

Some of the metadata descriptions for the [Cairo Genizah](https://cudl.lib.cam.ac.uk/collections/genizah/) fragements are medical in nature, for example [T-S Ar.43.324](https://cudl.lib.cam.ac.uk/view/MS-TS-AR-00043-00324/1).

We'd like to analyse the descriptions of these fragments to see what we can learn about medicine.

This repository's `medical-data` dir contains `genizah-tei.tar.lz`, which is a collection of all of the Genizah TEI metadata. (This file is generated by [bundle-genizah-tei.sh](../medical-data/bundle-genizah-tei.sh).)

In [1]:
import re
import sys
import tarfile
import warnings

from lxml import etree
import numpy as np
import pandas as pd

Define some functions to work with the TEI metadata.

In [2]:
import genizahdata as gd

In [3]:
# Suppress warnings about messy metadata
warnings.filterwarnings('ignore', category=gd.GenizahDataWarning)

Load descriptions of medical fragments and store them in a pandas data frame.

In [4]:
bundle = tarfile.open('../medical-data/genizah-tei.tar.lz')

data = pd.DataFrame.from_records(
    (gd.get_data(path, root) for path, root in gd.medical_elements(gd.extract_tar_xml(bundle))),
    index='classmark')
data.head()

Unnamed: 0_level_0,columns,date_end,date_start,height,lines,material,summary,title,width
classmark,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
MS-OR-01080-00001-00063,1.0,1899-12-31,0500-01-01,21.2,21.0,paper,"Pharmacopoeia, containing diagrams and symbols...",Medical,14.3
MS-OR-01080-00001-00072,1.0,1899-12-31,0500-01-01,36.4,22.0,vellum,"Discussion of various medical treatments, regi...",Medical,16.8
MS-OR-01080-00001-00081,1.0,1899-12-31,0500-01-01,25.4,12.0,paper,"Medical work on the composition of the body, c...",Medical,16.8
MS-OR-01080-00001-00087,1.0,1233-12-31,1213-01-01,,5.0,paper,Recto: a short medical recipe. Verso: a respon...,Medical,
MS-OR-01080-00002-00070,1.0,1199-12-31,1100-01-01,31.5,35.0,paper,Autograph draft of a medical work by Moses Mai...,Medical,22.8


## Cleanup

The material field contains some junk values:

In [5]:
data['material'].unique()

array(['paper', 'vellum', '9.1 x 9', 'paper 1 leaf', 'aper',
       'paper: 2 leaves (bifolium)', 'paper, 1 leaf', 'cloth',
       'cardboard'], dtype=object)

In [6]:
data.loc[data['material'].str.contains('aper'), 'material'] = 'paper'
data.loc[data['material'].str.contains('\d'), 'material'] = None
data['material'].unique()

array(['paper', 'vellum', None, 'cloth', 'cardboard'], dtype=object)

## Write out dataset

Create a JSON dataset from our Genizah medical metadata.

In [7]:
data.to_json('../medical-data/genizah-medical.json', orient='index')    