### Target Challenge
#### Silicon Slayers


### 1. Target
Our target is the number of people attending church for each census tract. We start with filtering the data frame down to just Churches by filtering by location name. We also take out the temples by using their specific addresses. Next we take our newly filtered data and organize it by census tract. We can do this because the data gives us the home census tract for each visitor at the church buildings. After that we calculate Sunday attendance by multiplying the total visitors for each census tract by the ratio of Sunday visits compared to the rest of the week. Finally we verify our results by making sure we have 298 tracts(the number of census tracts in Idaho) and graphing our results.

### 2. Pseudocode
- Load data and packages
- Filter places for LDS churches with reg ex 
- Filter out the Temples
- Join Filter places table to patterns table
- Isolate the month from the date_range_start column and create a month column
- Select needed columns and explode the visitor_home_aggregation
- Create scaled visitors column(logic is tract_visitors * (normalized_visits_by_state_scaling/raw_visit_counts))
- Collect the ratio of Sunday visits to the other weekdays and multiply that number to our new scaled visitors
- Group by tract and month and sum total scaled visitors
- Plot and explore!

In [0]:
# Load libraries
import pyspark.sql.functions as F
import requests
import shutil
import pandas as pd
import plotly.express as px

In [0]:
# Read data
patterns = spark.read.parquet('dbfs:/data/idaho/patterns')
places = spark.read.parquet('dbfs:/data/idaho/places.parquet')
temples = spark.read.parquet('dbfs:/FileStore/temple_details_2.parquet')

Filter to only the Church of Jesus Christ of Latter day Saints

In [0]:
# 384 Rows (exact match), 406 Rows (levenshtein, LDS matches, and excluding temples)
# My original code, replaced by Hathaway's
my_churches = places.filter(
    (places.location_name == 'LDS Church') | 
    (places.location_name == 'Church of Jesus Christ LDS') |
    (F.levenshtein(F.lit("The Church of Jesus Christ of Latter day Saints"), F.col("location_name")) < 10)
)

In [0]:
idaho_temples = places.filter(
    ((places.street_address == '750 S 2nd E') & (places.city == 'Rexburg')) | 
    ((places.street_address == '1000 Memorial Dr') & (places.city == 'Idaho Falls')) | 
    ((places.street_address == '3100 Butte St') & (places.city == 'Pocatello')) | 
    ((places.street_address == '1405 Eastland Dr N') & (places.city == 'Twin Falls')) | 
    ((places.street_address == '1211 S Cole Rd') & (places.city == 'Boise')) |
    ((places.street_address == '7355 N Linder Rd') & (places.city == 'Meridian'))
)

In [0]:
hathaway_churches = places.filter(
    (F.col("top_category") == "Religious Organizations") &
    (F.col("location_name").rlike("Latter|latter|Saints|saints|LDS|\b[Ww]ard\b")) &
    (F.col("location_name").rlike("^((?!Reorganized).)*$")) &
    (F.col("location_name").rlike("^((?!All Saints).)*$")) &
    (F.col("location_name").rlike("^((?![cC]ath).)*$")) &
    (F.col("location_name").rlike("^((?![Bb]ody).)*$")) &
    (F.col("location_name").rlike("^((?![Pp]eter).)*$")) &
    (F.col("location_name").rlike("^((?![Cc]atholic).)*$")) &
    (F.col("location_name").rlike("^((?![Pp]res).)*$")) &
    (F.col("location_name").rlike("^((?![Mm]inist).)*$")) &
    (F.col("location_name").rlike("^((?![Mm]ission).)*$")) &
    (F.col("location_name").rlike("^((?![Ww]orship).)*$")) &
    (F.col("location_name").rlike("^((?![Rr]ain).)*$")) &
    (F.col("location_name").rlike("^((?![Bb]aptist).)*$")) &
    (F.col("location_name").rlike("^((?![Mm]eth).)*$")) &
    (F.col("location_name").rlike("^((?![Ee]vang).)*$")) &
    (F.col("location_name").rlike("^((?![Ll]utheran).)*$")) &
    (F.col("location_name").rlike("^((?![Oo]rthodox).)*$")) &
    (F.col("location_name").rlike("^((?![Ee]piscopal).)*$")) &
    (F.col("location_name").rlike("^((?![Tt]abernacle).)*$")) &
    (F.col("location_name").rlike("^((?![Hh]arvest).)*$")) &
    (F.col("location_name").rlike("^((?![Aa]ssem).)*$")) &
    (F.col("location_name").rlike("^((?![Mm]edia).)*$")) &
    (F.col("location_name").rlike("^((?![Mm]artha).)*$")) &
    (F.col("location_name").rlike("^((?![Cc]hristian).)*$")) &
    (F.col("location_name").rlike("^((?![Uu]nited).)*$")) &
    (F.col("location_name").rlike("^((?![Ff]ellowship).)*$")) &
    (F.col("location_name").rlike("^((?![Ww]esl).)*$")) &
    (F.col("location_name").rlike("^((?![C]cosmas).)*$")) &
    (F.col("location_name").rlike("^((?![Gg]reater).)*$")) &
    (F.col("location_name").rlike("^((?![Pp]rison).)*$")) &
    (F.col("location_name").rlike("^((?![Cc]ommuni).)*$")) &
    (F.col("location_name").rlike("^((?![Cc]lement).)*$")) &
    (F.col("location_name").rlike("^((?![Vv]iridian).)*$")) &
    (F.col("location_name").rlike("^((?![Dd]iocese).)*$")) &
    (F.col("location_name").rlike("^((?![Hh]istory).)*$")) &
    (F.col("location_name").rlike("^((?![Ss]chool).)*$")) &
    (F.col("location_name").rlike("^((?![Tt]hougt).)*$")) &
    (F.col("location_name").rlike("^((?![Hh]oliness).)*$")) &
    (F.col("location_name").rlike("^((?![Mm]artyr).)*$")) &
    (F.col("location_name").rlike("^((?![Jj]ames).)*$")) &
    (F.col("location_name").rlike("^((?![Ff]ellowship).)*$")) &
    (F.col("location_name").rlike("^((?![Hh]ouse).)*$")) &
    (F.col("location_name").rlike("^((?![Gg]lory).)*$")) &
    (F.col("location_name").rlike("^((?![Aa]nglican).)*$")) &
    (F.col("location_name").rlike("^((?![Pp]oetic).)*$")) &
    (F.col("location_name").rlike("^((?![Ss]anctuary).)*$")) &
    (F.col("location_name").rlike("^((?![Ee]quipping).)*$")) &
    (F.col("location_name").rlike("^((?![Jj]ohn).)*$")) &
    (F.col("location_name").rlike("^((?![Aa]ndrew).)*$")) &
    (F.col("location_name").rlike("^((?![Ee]manuel).)*$")) &
    (F.col("location_name").rlike("^((?![Rr]edeemed).)*$")) &
    (F.col("location_name").rlike("^((?![Pp]erfecting).)*$")) &
    (F.col("location_name").rlike("^((?![Aa]ngel).)*$")) &
    (F.col("location_name").rlike("^((?![Aa]rchangel).)*$")) &
    (F.col("location_name").rlike("^((?![Mm]icheal).)*$")) &
    (F.col("location_name").rlike("^((?![Tt]hought).)*$")) &
    (F.col("location_name").rlike("^((?![Pp]ariosse).)*$")) &
    (F.col("location_name").rlike("^((?![Cc]osmas).)*$")) &
    (F.col("location_name").rlike("^((?![Dd]eliverance).)*$")) &
    (F.col("location_name").rlike("^((?![Ss]ociete).)*$")) &
    (F.col("location_name").rlike("^((?![Tt]emple).)*$")) &
    (F.col("location_name").rlike("^((?![Ss]eminary).)*$")) &
    (F.col("location_name").rlike("^((?![Ee]mployment).)*$")) &
    (F.col("location_name").rlike("^((?![Ii]nstitute).)*$")) &
    (F.col("location_name").rlike("^((?![Cc]amp).)*$")) &
    (F.col("location_name").rlike("^((?![Ss]tudent).)*$")) &
    (F.col("location_name").rlike("^((?![Ee]ducation).)*$")) &
    (F.col("location_name").rlike("^((?![Ss]ocial).)*$")) &
    (F.col("location_name").rlike("^((?![Ww]welfare).)*$")) &
    (F.col("location_name").rlike("^((?![Cc][Ee][Ss]).)*$")) &
    (F.col("location_name").rlike("^((?![Ff]amily).)*$")) &
    (F.col("location_name").rlike("^((?![Mm]ary).)*$")) &
    (F.col("location_name").rlike("^((?![Rr]ussian).)*$")) &
    (F.col("location_name").rlike("^((?![Bb]eautif).)*$")) &
    (F.col("location_name").rlike("^((?![Hh]eaven).)*$")) &    
    (F.col("location_name").rlike("^((?!Inc).)*$")) &
    (F.col("location_name").rlike("^((?!God).)*$"))
  )
  
hathaway_churches.display()

placekey,poi_cbg,parent_placekey,location_name,brands,safegraph_brand_ids,store_id,top_category,sub_category,naics_code,open_hours,category_tags,latitude,longitude,street_address,city,region,postal_code,iso_country_code,opened_on,closed_on,tracking_closed_since,websites,phone_number,wkt_area_sq_meters
zzy-222@5wj-krc-dsq,160499400001.0,,The Church of Jesus Christ of Latter day Saints,,,,Religious Organizations,Religious Organizations,813110,,Churches,45.927403,-116.11182,Valley Vw,Kamiah,ID,83536,US,,,2019-07,[],12089352842.0,1362.0
zzw-222@5wq-s83-g8v,160679702001.0,,The Church of Jesus Christ of Latter day Saints,,,,Religious Organizations,Religious Organizations,813110,,Churches,42.557902,-113.792354,241 N Overland Ave,Burley,ID,83318,US,,,2019-07,"[""lds.org""]",12086780434.0,954.0
222-222@5wf-zyt-rhq,160439703001.0,,The Church of Jesus Christ of Latter day Saints,,,,Religious Organizations,Religious Organizations,813110,,Churches,43.967664,-111.680578,145 E 1st N,St. Anthony,ID,83445,US,,,2019-07,"[""lds.org""]",12086247540.0,1281.0
223-223@5w9-hwd-5mk,160010102252.0,,The Church of Jesus Christ of Latter day Saints,,,,Religious Organizations,Religious Organizations,813110,,Churches,43.695074,-116.345195,700 E State St,Eagle,ID,83616,US,,,2019-07,[],12089392988.0,522.0
zzy-222@5ws-j8j-6x5,160419701002.0,,The Church of Jesus Christ of Latter day Saints,,,,Religious Organizations,Religious Organizations,813110,,Churches,42.100707,-111.871853,213 N 2nd E,Preston,ID,83263,US,,,2019-07,"[""lds.org""]",12088521469.0,60.0
zzy-222@5wq-s89-9cq,,,The Church of Jesus Christ of Latter day Saints,,,,Religious Organizations,Religious Organizations,813110,,Churches,42.537956,-113.794962,213 W Main St,Burley,ID,83318,US,,,2019-07,"[""lds.org""]",12086542591.0,923.0
222-222@5ws-hhd-zpv,,,The Church of Jesus Christ of Latter day Saints,,,,Religious Organizations,Religious Organizations,813110,,Churches,42.230853,-111.95055,7389 N 3000 W,Preston,ID,83263,US,,,2019-07,"[""lds.org""]",12088522768.0,218.0
zzw-222@5wq-s83-g8v,,,The Church of Jesus Christ of Latter day Saints,,,,Religious Organizations,Religious Organizations,813110,,Churches,42.557902,-113.792354,241 N Overland Ave,Burley,ID,83318,US,,,2019-07,"[""lds.org""]",12086780434.0,954.0
zzy-222@5wr-7dm-9xq,160830009001.0,,The Church of Jesus Christ of Latter day Saints,,,,Religious Organizations,Religious Organizations,813110,,Churches,42.586804,-114.443502,2085 S Temple Dr,Twin Falls,ID,83301,US,,,2019-07,"[""mormon.org""]",12087333446.0,2467.0
zzy-222@5w9-jbw-p7q,160010001001.0,,The Church of Jesus Christ of Latter day Saints,,,,Religious Organizations,Religious Organizations,813110,,Churches,43.619514,-116.20184,Shamrock & Mcmillanrd,Boise,ID,83702,US,,,2019-07,"[""lds.org""]",12083773220.0,1210.0


In [0]:
# Exclude the temples
churches = hathaway_churches.join(idaho_temples, hathaway_churches.placekey == idaho_temples.placekey, how='left_anti')

### 3. Diagram of tables and columns used to build the feature

In [0]:
# Diagram of Tables and Columns
print("Church Schema: ")
churches.select(
    'placekey',
    'location_name',
).printSchema()
print()
print("Patterns Schema: ")
patterns.select(
    'placekey',
    'date_range_start',
    'raw_visit_counts',
    'popularity_by_day',
    'visitor_home_aggregation',
    'normalized_visits_by_state_scaling',
).printSchema()

Church Schema: 
root
 |-- placekey: string (nullable = true)
 |-- location_name: string (nullable = true)


Patterns Schema: 
root
 |-- placekey: string (nullable = true)
 |-- date_range_start: string (nullable = true)
 |-- raw_visit_counts: double (nullable = true)
 |-- popularity_by_day: map (nullable = true)
 |    |-- key: string
 |    |-- value: integer (valueContainsNull = true)
 |-- visitor_home_aggregation: map (nullable = true)
 |    |-- key: string
 |    |-- value: integer (valueContainsNull = true)
 |-- normalized_visits_by_state_scaling: double (nullable = true)



### 4. Code Snippet of Data Wrangling

In [0]:
""" 
Assumptions: 
The state scaling ratio is accurate. 
"""
# Filter to just Church of Jesus Christ of Latter-day Saints
church_patterns = patterns.join(churches, on='placekey', how='leftsemi')

# visitor_home_aggregation
home_agg = church_patterns.select(
    "*", 
    # Explodes the map of census tracts and visitor counts
    F.explode(
        F.col('visitor_home_aggregation')
        ).alias('tract', 'tract_visitors'),    
    # Gets the state scaling ratio
    (F.col('normalized_visits_by_state_scaling')/F.col('raw_visit_counts')).alias('state_scaling'),
    # Multiplies tract visitors by the state scaling ratio to get a more accurate total estimate
    (F.col('tract_visitors') * F.col('state_scaling')).alias('tract_visitors_scaled')
)

In [0]:
tracts = patterns.select(
    F.explode(
        F.col('visitor_home_aggregation')
        ).alias('tract', 'tract_visitors'),  
)\
      .select('tract')\
      .filter(F.col('tract').startswith('16'))\
      .distinct()

In [0]:
# Calculates the Sunday visits by tract
final_df = tracts.join(
    home_agg, on='tract', how='left'
                       
).select(
    "*",
    F.month(F.col('date_range_start')).alias('month')

).groupBy(
    'tract', 
    'month',

).agg(
    F.sum('tract_visitors_scaled').alias('sum_tract_visitors'),

).groupBy(
    'tract'

).agg(
    F.median('sum_tract_visitors').alias('active_member'),

)

final_df.display()

tract,active_member
16001000100,192.6767738023044
16001000201,430.0642655367232
16001000202,773.8599107074779
16001000302,237.35968623374936
16001000303,199.51837259316508
16001000304,680.9967214225804
16001000400,257.7632296011499
16001000500,537.1648492892907
16001000600,368.62651331719127
16001000701,271.6480451325112


In [0]:
final_df.write.parquet('/tmp/data/silicon_slayers.parquet')

In [0]:
final_df.select('active_member').summary().display()

summary,active_member
count,289.0
mean,698.16329921248
stddev,911.1165685590776
min,61.3722232953448
25%,144.9918166506468
50%,383.33994684323415
75%,783.3313407990313
max,5828.472407519707


In [0]:
rexburg_tracts = ["16065950100", "16065950200", "16065950400", "16065950301", "16065950500", "16065950302"]
courd_tracts = ["16055000402", "16055000401", "16055001200", "16055000900"]

In [0]:
final_df.filter(final_df["tract"].isin(rexburg_tracts)).display()

tract,active_member
16065950500,2210.412078051791
16065950100,2409.568038240532
16065950302,5828.472407519707
16065950200,1472.9333590882752
16065950301,1869.577722382577
16065950400,1263.209750271639


In [0]:
final_df.filter(final_df["tract"].isin(courd_tracts)).display()

tract,active_member
16055001200,91.44127535715484
16055000402,124.13046831720118
16055000401,73.34316765649406
16055000900,178.01976467531202


### 5. Visualizations

In [0]:
from pyspark.sql.types import DoubleType
vp = spark.read.csv('dbfs:/FileStore/virtual_pigeons.csv')
ss_vp = final_df.join(
    vp.select(F.col('_c0').alias('tract'),
              F.col('_c1').cast(DoubleType()).alias('Virtual Pigeons')),
    on='tract', how='left'
).select(
    'Tract',
    F.col('active_member').alias('Silicon Slayers'),
    'Virtual Pigeons'
)
ss_vp.display()
fig = px.box(ss_vp.toPandas(), y=['Silicon Slayers', 'Virtual Pigeons'], range_y=[0, 2200],
             title='Silicon Slayers and Virtual Pigeons Comparison'    
             )
fig.update_layout(
    xaxis_title="",
    yaxis_title="Active Members Per Tract"
)
fig.show()

Tract,Silicon Slayers,Virtual Pigeons
16001000100,192.6767738023044,
16001000201,430.0642655367232,104.0
16001000202,773.8599107074779,518.0
16001000302,237.35968623374936,61.0
16001000303,199.51837259316508,148.0
16001000304,680.9967214225804,370.0
16001000400,257.7632296011499,141.0
16001000500,537.1648492892907,71.0
16001000600,368.62651331719127,130.0
16001000701,271.6480451325112,82.0


In [0]:
px.histogram(final_df.toPandas(), x='active_member').show()

In [0]:
population = spark.read.parquet('dbfs:/FileStore/population.parquet')

active_per = final_df.join(population, on='tract').select(
    "*", 
    (F.col('active_member') / F.col('population')).alias('active_percent')
)

px.histogram(active_per.toPandas(), x='active_percent').show()