# Exploratory Data Analysis of the Greenwood Data

This Exploratory Data Analysis is being done to help with the spot check of certain columns to identify data that needs to be fixed.

We will follow the steps below:
- Load all the volumes data into one DataFrame
- Select the columns that we are going to work on
- Identify the unique values on each column with the counts
- Will select the burial records that contains the cases we want to fix
- Fix the records
- Generate a new json file with the fixes to update our interments index

## Initializing the Notebook

In [13]:
import pandas as pd
import os
import duckdb

%load_ext sql

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


### Initialize the sql database

In [14]:
%sql duckdb:///sqlite/db.duckdb

In [15]:
%sql SET GLOBAL pandas_analyze_sample=600000

%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

load_volumes = True

## Loading all volumes

We are going to load all volumes json files into one single in memory structure called DataFrame, that will give us analytical capabilities to identify the values on each column and to fix, if needed.

In [16]:
if load_volumes:
  records = []
  for file in os.listdir('json'):
    records.append(pd.read_json(os.path.join('json',file)))

  df = pd.concat(records)

Now, we are going to import into duckdb, to be able to use SQL to do our validation.

In [17]:
if load_volumes:
  %sql drop table if exists interments;

  %sql create table interments as select * from df;

## Identify Columns to Work

Our first step to identify the columns we want to work with. We will do this by looking the columns in the search filters of the website.
The columns we identify there are:
- date of interment
- birthplace
- marital status
- age at death
- late residence
- place of death
- cause of death
- date of death
- undertaker
- burial registry
- lot number

Next step is to list the columns on our DataFrame.

In [18]:
if load_volumes:
  columns = df.columns.to_list()

  for idx, column in enumerate(columns):
    print(column,end=', ')
    if (idx+1) % 7 == 0:
        print('')

interment_id, registry_image, interment_date_month_transcribed, interment_date_day_transcribed, interment_date_year_transcribed, interment_date_display, interment_date_iso, 
name_transcribed, name_display, name_last, name_first, name_middle, name_salutation, name_suffix, 
is_lot_owner, gender_guess, burial_location_lot_transcribed, burial_location_lot_current, burial_location_lot_previous, burial_location_grave_transcribed, burial_location_grave_current, 
burial_location_grave_previous, birth_place_transcribed, birth_place_displayed, birth_geo_formatted_address, birth_geo_is_faulty, birth_geo_street_number, birth_geo_street_name_long, 
birth_geo_street_name_short, birth_geo_neighborhood, birth_geo_city, birth_geo_county, birth_geo_state_short, birth_geo_state_long, birth_geo_country_long, 
birth_geo_country_short, birth_geo_zip, birth_geo_place_id, birth_geo_formatted_address_extra, birth_place_geo_location, age_years_transcribed, age_months_transcribed, 
age_days_transcribed, age_hour

Now, we are going to map the columns we want.
- date of interment = interment_date_year_transcribed
- birthplace = birth_place_display
- marital status = marital_status [x]
- age at death = age_years [x]
- late residence = residence_place_city_display
- place of death = death_place_display
- cause of death = cause_of_death_display [x]
- date of death = death_date_year_transcribed [x]
- undertaker = undertaker_display
- burial registry = interment_id, registry_volume
- lot number = burial_location_lot_current


In [21]:
%sql --conn --close