# Exploratory Data Analysis of the Greenwood Data

This Exploratory Data Analysis is being done to help with the spot check of certain columns to identify data that needs to be fixed.

We will follow the steps below:
- Load all the volumes data into one DataFrame
- Select the columns that we are going to work on
- Identify the unique values on each column with the counts
- Will select the burial records that contains the cases we want to fix
- Fix the records
- Generate a new json file with the fixes to update our interments index

## Initializing the Notebook

In [1]:
import pandas as pd
import os
import duckdb

%load_ext sql

### Initialize the sql database

In [2]:
%sql duckdb:///sqlite/db.duckdb

%sql SET GLOBAL pandas_analyze_sample=600000

%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

load_volumes = True

*  duckdb:///sqlite/db.duckdb
Done.


## Loading all volumes

We are going to load all volumes json files into one single in memory structure called DataFrame, that will give us analytical capabilities to identify the values on each column and to fix, if needed.

In [3]:
if load_volumes:
  records = []
  for file in os.listdir('json'):
    records.append(pd.read_json(os.path.join('json',file)))

  df = pd.concat(records)

Now, we are going to import into duckdb, to be able to use SQL to do our validation.

In [4]:
if load_volumes:
  %sql drop table if exists interments;

  %sql create table interments as select * from df;

## Identify Columns to Work

Our first step to identify the columns we want to work with. We will do this by looking the columns in the search filters of the website.
The columns we identify there are:
- date of interment
- birthplace
- marital status
- age at death
- late residence
- place of death
- cause of death
- date of death
- undertaker
- burial registry
- lot number

Next step is to list the columns on our DataFrame.

In [5]:
if load_volumes:
  columns = df.columns.to_list()

  for idx, column in enumerate(columns):
    print(column,end=', ')
    if (idx+1) % 7 == 0:
        print('')

interment_id, registry_image, interment_date_month_transcribed, interment_date_day_transcribed, interment_date_year_transcribed, interment_date_display, interment_date_iso, 
name_transcribed, name_display, name_last, name_first, name_middle, name_salutation, name_suffix, 
is_lot_owner, gender_guess, burial_location_lot_transcribed, burial_location_lot_current, burial_location_lot_previous, burial_location_grave_transcribed, burial_location_grave_current, 
burial_location_grave_previous, birth_place_transcribed, birth_place_displayed, birth_geo_formatted_address, birth_geo_is_faulty, birth_geo_street_number, birth_geo_street_name_long, 
birth_geo_street_name_short, birth_geo_neighborhood, birth_geo_city, birth_geo_county, birth_geo_state_short, birth_geo_state_long, birth_geo_country_long, 
birth_geo_country_short, birth_geo_zip, birth_geo_place_id, birth_geo_formatted_address_extra, birth_place_geo_location, age_years_transcribed, age_months_transcribed, 
age_days_transcribed, age_hour

Now, we are going to map the columns we want.
- date of interment = interment_date_year_transcribed
- birthplace = birth_place_display
- marital status = marital_status [x]
- age at death = age_years [x]
- late residence = residence_place_city_display
- place of death = death_place_display
- cause of death = cause_of_death_display [x]
- date of death = death_date_year_transcribed [x]
- undertaker = undertaker_display
- burial registry = interment_id, registry_volume
- lot number = burial_location_lot_current


Now, we are going to work with SQL statements to discover issues with the data on those fields.

### Scratchpad

In [6]:
%%sql

select count(*)
from interments
where cause_of_death_display = '' 
or cause_of_death_display = '-'
or cause_of_death_display = '- P' 

count_star()
0


In [7]:
%%sql

select count(*)
from interments
where cause_of_death_display like '1%' or 
cause_of_death_display like '2%' or
cause_of_death_display like '3%' or
cause_of_death_display like '4%' or
cause_of_death_display like '5%' or
cause_of_death_display like '6%' or
cause_of_death_display like '7%' or
cause_of_death_display like '8%' or
cause_of_death_display like '9%' or
cause_of_death_display like '0%'

count_star()
0


In [8]:
%%sql

select cause_of_death_display as cause_of_death, count(*) as records
from interments
group by cause_of_death_display
order by 1
limit 10

cause_of_death,records
,14538
& Bese,1
& Prolonged Labor,1
() Old Age () Diarrhoea,1
(Accidental) Mercury Bichloride,1
(Charity Hospital) Brights,1
(Heart Failure ) Typhoid Fever,1
(Non Contagious),1
(Post Operation) Exophthalmia Goitres,1
A,1


In [9]:
%%sql 

select column_name
from information_schema.columns
where table_name = 'interments'
and column_name like 'a%'

column_name
age_years_transcribed
age_months_transcribed
age_days_transcribed
age_hours_transcribed
age_display
age_years
age_months
age_days
age_hours


In [10]:
%%sql

with raw_data as (
SELECT interment_id, registry_volume, registry_image, age_years, try_cast(age_years_transcribed as integer) as age_years_converted, age_years_transcribed, age_display
FROM interments)
select * from raw_data
where age_years_converted > 117

interment_id,registry_volume,registry_image,age_years,age_years_converted,age_years_transcribed,age_display
406593.0,56.0,Volume 56_093,0,779,779,"779 years, 5 days"
9864.0,1.0,Volume 01_221,0,339,339,"339 years, 8 months, 13 days"
375371.0,52.0,Volume 52_042,0,788,788,"788 years, 5 months, 13 days"
384834.0,53.0,Volume 53_097,0,664,664,"664 years, 2 months, 8 days"
349562.0,49.0,Volume 49_026,0,809,809,"809 years, 1 month"
394535.0,55.0,Volume 55_041,0,862,862,"862 years, 19 months"
311572.0,43.0,Volume 43_056,0,789,789,789 years
311619.0,43.0,Volume 43_057,0,289,289,"289 years, 6 months"


In [11]:
%%sql

select marital_status, count(*) as recs
from interments
group by marital_status

marital_status,recs
Widow,74211
Single,57426
Married,132309
Not recorded,168809
Divorced,408
Infant,19
Unknown,4
Separated,9
Marriage Annulled,1
Legally Separated,2
