# Exploratory Data Analysis of the Greenwood Data

This Exploratory Data Analysis is being done to help with the spot check of certain columns to identify data that needs to be fixed.

We will follow the steps below:
- Load all the volumes data into one DataFrame
- Select the columns that we are going to work on
- Identify the unique values on each column with the counts
- Will select the burial records that contains the cases we want to fix
- Fix the records
- Generate a new json file with the fixes to update our interments index

## Initializing the Notebook

In [1]:
import pandas as pd
import os
import duckdb

%load_ext sql

### Initialize the sql database

In [2]:
%sql duckdb:///sqlite/db.duckdb

%sql SET GLOBAL pandas_analyze_sample=600000

%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

*  duckdb:///sqlite/db.duckdb
Done.


## Loading all volumes

We are going to load all volumes json files into one single in memory structure called DataFrame, that will give us analytical capabilities to identify the values on each column and to fix, if needed.

In [3]:
records = []
for file in os.listdir('json'):
    records.append(pd.read_json(os.path.join('json',file)))

df = pd.concat(records)

Now, we are going to import into duckdb, to be able to use SQL to do our validation.

In [5]:
%%sql 

drop table if exists interments;

create table interments as select * from df;

Count
433848


## Identify Columns to Work

Our first step to identify the columns we want to work with. We will do this by looking the columns in the search filters of the website.
The columns we identify there are:
- date of interment
- birthplace
- marital status
- age at death
- late residence
- place of death
- cause of death
- date of death
- undertaker
- burial registry
- lot number

Next step is to list the columns on our DataFrame.

In [6]:
columns = df.columns.to_list()

for idx, column in enumerate(columns):
    print(column,end=', ')
    if (idx+1) % 7 == 0:
        print('')

interment_id, registry_image, interment_date_month_transcribed, interment_date_day_transcribed, interment_date_year_transcribed, interment_date_display, interment_date_iso, 
name_transcribed, name_display, name_last, name_first, name_middle, name_salutation, name_suffix, 
is_lot_owner, gender_guess, burial_location_lot_transcribed, burial_location_lot_current, burial_location_lot_previous, burial_location_grave_transcribed, burial_location_grave_current, 
burial_location_grave_previous, birth_place_transcribed, birth_place_displayed, birth_geo_formatted_address, birth_geo_is_faulty, birth_geo_street_number, birth_geo_street_name_long, 
birth_geo_street_name_short, birth_geo_neighborhood, birth_geo_city, birth_geo_county, birth_geo_state_short, birth_geo_state_long, birth_geo_country_long, 
birth_geo_country_short, birth_geo_zip, birth_geo_place_id, birth_geo_formatted_address_extra, birth_place_geo_location, age_years_transcribed, age_months_transcribed, 
age_days_transcribed, age_hour

Now, we are going to map the columns we want.
- date of interment = interment_date_year_transcribed
- birthplace = birth_place_displayed
- marital status = marital_status
- age at death = age_years
- late residence = residence_place_city_display
- place of death = death_place_display
- cause of death = cause_of_death_display
- date of death = death_date_year_transcribed
- undertaker = undertaker_display
- burial registry = interment_id, registry_volume
- lot number = urial_location_lot_current


Now, we are going to work with SQL statements to discover issues with the data on those fields.

### Scratchpad

In [8]:
%%sql

select cause_of_death_display as cause_of_death, count(*) as records
from interments
group by cause_of_death_display
order by 1 
limit 10

cause_of_death,records
,635
,13890
""" Compression Of The Brain",1
""" Phthisis",1
""" Pneumonia",1
-,1
Apoplexy,1
Asphyxia,1
Chronic Myocarditis,1
Consumption,1


In [9]:
%%sql

update interments
set cause_of_death_display = ltrim(replace(cause_of_death_display, '"', ''))

Count
433848


In [15]:
%%sql

update interments
set cause_of_death_display = NULL
where cause_of_death_display = '' 
or cause_of_death_display = '-'
or cause_of_death_display = '- P' 

Count
1


In [21]:
%%sql

select cause_of_death_display, ltrim(regexp_replace(cause_of_death_display,'[0-9]','','g'))
from interments
where cause_of_death_display like '1%'

cause_of_death_display,"ltrim(regexp_replace(cause_of_death_display, '[0-9]', '', 'g'))"
1827 Of The Brain,Of The Brain
1863 Dysentery,Dysentery
1871 Scarlatina,Scarlatina
1878 Fatty Heart,Fatty Heart
1878 Asthenia Meningitis,Asthenia Meningitis
1879 Convulsions,Convulsions
1879 Peritonitis,Peritonitis
1879 Brights,Brights
1849 Consumption,Consumption


In [17]:
%%sql

select cause_of_death_display as cause_of_death, count(*) as records
from interments
group by cause_of_death_display
order by 1

cause_of_death,records
,14530
& Bese,1
& Prolonged Labor,1
(1) Old Age (2) Diarrhoea,1
(Accidental) Mercury Bichloride,1
(Charity Hospital) Brights,1
(Heart Failure ) Typhoid Fever,1
(Non Contagious),1
(Post Operation) Exophthalmia Goitres,1
1827 Of The Brain,1
