# Introduction/Business Problem
This notebook exists for the Data Science Capstone project from Coursera.  I am unable to get a Foursquare account back up and running.  Reasons unknown, which is okay, because I'd much rather do the project on what I love -- Family History!!!

## Problem and Background
One of the largest internet activities people engage in worldwide is researching **family history**.  As a consultant for this activity in my city, I meet with many people trying to get started in their work -- having to sift through work that has been done.  Many have families that have collected much information on the vital events (birth, marriage, death/burial) of their ancestor families for many decades now, and they are drowning in piles of details.  The information available represents research done, and serves as a launch-point for finding new ancestors and relatives -- if they can make sense of it. Even when family trees in online ancestral tree programs, it's difficult to navigate in a meaningful way.  My personal tree is 3990 people large, covering 7720 vital events, which is somewhat intimidating.  

When selecting a project, we simply want to know, "Where is an opportunity for me to make a difference?" or "Where is a good location for me to become an expert in to help others with?" Most would like some quick, helpful, overview information for deciding **where would be most productive to focus time and effort**.  While there are many tools out there, I haven't found anywhere that addresses this question.

This question would take a long time to figure out if going through information family by family -- too long.  Part of it involves finding dense pockets of **family events** at locations (by geocode), and another part involves identifying the **quality/completion** of the information that exists. Quality is not easily measured, as it involves a judgement call, but completion can be identified as 1) having a fact present, and 2) having at least 2 sources for each fact.  It takes a judgment call to determine whether the attached sources are quality, so we will only count whether they exist. 

Most data can be exported from family history tools as a **standard GEDCOM file**, which is an old standard, but contains enough information for our purposes.  My goal is to take a given GEDCOM file containing a family tree, and output information that clusters high-density, family history event locations and the average completion status of those clusters, ranking them according to their value as a productive project area.

## Data Description
We'll use **GEDCOM** files as input.  It's old, but has a consistent, tab-delimited format.  There are several sections: 

1. One section contains **individuals** (with unique IDs) and their **birth and death** information with their date and place details and source IDs, and links them to their family group IDs (both as a child, and as a spouse). 
2. Another section contains **family groups**, which contains **marriage** information and its source IDs.  
3. Other sections contain unnecessary details for our purposes.  

Lines begin with a number in the first column, and some identifying information in the second column, and finally details (if any) in the 3rd column. Each section starts a new person or family with a '0' in the first column. Higher numbers pertain to that person or family, with events having a '1' in the first column.

In the GEDCOM, we will concern ourselves with **only vital events (birth, marriage, and death)** and the **counts of sources** for each event.  Each event will have a **date and place** associated with it, but the information could also be absent.  If a place is present, we use the event; otherwise, we skip it.  

For completion scores, we will concern ourselves with only single events. Completion scores for individuals (including all their events) is beyond the scope of this project.

The data in the GEDCOM is often incomplete, poorly documented, or recorded in non-standard ways.  For locations, the most helpful data would be a geocode of where the event occurred, which is not in the GEDCOMs, so **significant wrangling effort** is required to prepare to automate **finding geocodes**.  It is satisfactory to locate at least 70% of them.  

Because of the significant wrangling effort and the long time it takes to get geocodes for thousands of events, I will wrangle it in advance, and provide a CSV file of events for this project to read into a Pandas Dataframe with the following columns:  **Name, Gender, Event_Type, Date, Place, Latitude, Longitude, and Completion_Score**.

Lab link at https://github.com/SonjaThompson/Coursera_Capstone/blob/main/Coursera_Data_Science_Capstone.ipynb

In [None]:
import pandas as pd
import numpy as np
import requests # library to handle requests
import os

import FHclasses as fh
from gedcomRead import readPersons
import folium
from folium import plugins
import pickle

In [None]:
columns = ['Name', 'Gender', 'Event_Type', 'Date', 'Place', 'Latitude', 'Longitude', 'Completion_Score']