### Target Challenge
#### Silicon Slayers


### 1. Target
Our target is the number of people attending church for each census tract. We start with filtering the data frame down to just Churches by filtering by location name. We also take out the temples by using their specific addresses. Next we take our newly filtered data and organize it by census tract. We can do this because the data gives us the home census tract for each visitor at the church buildings. After that we calculate Sunday attendance by multiplying the total visitors for each census tract by the ratio of Sunday visits compared to the rest of the week. Finally we verify our results by making sure we have 298 tracts(the number of census tracts in Idaho) and graphing our results.

### 2. Pseudocode
- Load data and packages
- Filter places for LDS churches with reg ex 
- Filter out the Temples
- Join Filter places table to patterns table
- Isolate the month from the date_range_start column and create a month column
- Select needed columns and explode the visitor_home_aggregation
- Create scaled visitors column(logic is tract_visitors * (normalized_visits_by_state_scaling/raw_visit_counts))
- Collect the ratio of Sunday visits to the other weekdays and multiply that number to our new scaled visitors
- Group by tract and month and sum total scaled visitors
- Plot and explore!

In [0]:
# Load libraries
import pyspark.sql.functions as F
import requests
import shutil
import pandas as pd
import plotly.express as px

In [0]:
# Read data
patterns = spark.read.parquet('dbfs:/data/idaho/patterns')
places = spark.read.parquet('dbfs:/data/idaho/places.parquet')
temples = spark.read.parquet('dbfs:/FileStore/temple_details_2.parquet')

Filter to only the Church of Jesus Christ of Latter day Saints

In [0]:
# 384 Rows (exact match), 406 Rows (levenshtein, LDS matches, and excluding temples)
# My original code, replaced by Hathaway's
my_churches = places.filter(
    (places.location_name == 'LDS Church') | 
    (places.location_name == 'Church of Jesus Christ LDS') |
    (F.levenshtein(F.lit("The Church of Jesus Christ of Latter day Saints"), F.col("location_name")) < 10)
)

In [0]:
idaho_temples = places.filter(
    ((places.street_address == '750 S 2nd E') & (places.city == 'Rexburg')) | 
    ((places.street_address == '1000 Memorial Dr') & (places.city == 'Idaho Falls')) | 
    ((places.street_address == '3100 Butte St') & (places.city == 'Pocatello')) | 
    ((places.street_address == '1405 Eastland Dr N') & (places.city == 'Twin Falls')) | 
    ((places.street_address == '1211 S Cole Rd') & (places.city == 'Boise')) |
    ((places.street_address == '7355 N Linder Rd') & (places.city == 'Meridian'))
)

In [0]:
hathaway_churches = places.filter(
    (F.col("top_category") == "Religious Organizations") &
    (F.col("location_name").rlike("Latter|latter|Saints|saints|LDS|\b[Ww]ard\b")) &
    (F.col("location_name").rlike("^((?!Reorganized).)*$")) &
    (F.col("location_name").rlike("^((?!All Saints).)*$")) &
    (F.col("location_name").rlike("^((?![cC]ath).)*$")) &
    (F.col("location_name").rlike("^((?![Bb]ody).)*$")) &
    (F.col("location_name").rlike("^((?![Pp]eter).)*$")) &
    (F.col("location_name").rlike("^((?![Cc]atholic).)*$")) &
    (F.col("location_name").rlike("^((?![Pp]res).)*$")) &
    (F.col("location_name").rlike("^((?![Mm]inist).)*$")) &
    (F.col("location_name").rlike("^((?![Mm]ission).)*$")) &
    (F.col("location_name").rlike("^((?![Ww]orship).)*$")) &
    (F.col("location_name").rlike("^((?![Rr]ain).)*$")) &
    (F.col("location_name").rlike("^((?![Bb]aptist).)*$")) &
    (F.col("location_name").rlike("^((?![Mm]eth).)*$")) &
    (F.col("location_name").rlike("^((?![Ee]vang).)*$")) &
    (F.col("location_name").rlike("^((?![Ll]utheran).)*$")) &
    (F.col("location_name").rlike("^((?![Oo]rthodox).)*$")) &
    (F.col("location_name").rlike("^((?![Ee]piscopal).)*$")) &
    (F.col("location_name").rlike("^((?![Tt]abernacle).)*$")) &
    (F.col("location_name").rlike("^((?![Hh]arvest).)*$")) &
    (F.col("location_name").rlike("^((?![Aa]ssem).)*$")) &
    (F.col("location_name").rlike("^((?![Mm]edia).)*$")) &
    (F.col("location_name").rlike("^((?![Mm]artha).)*$")) &
    (F.col("location_name").rlike("^((?![Cc]hristian).)*$")) &
    (F.col("location_name").rlike("^((?![Uu]nited).)*$")) &
    (F.col("location_name").rlike("^((?![Ff]ellowship).)*$")) &
    (F.col("location_name").rlike("^((?![Ww]esl).)*$")) &
    (F.col("location_name").rlike("^((?![C]cosmas).)*$")) &
    (F.col("location_name").rlike("^((?![Gg]reater).)*$")) &
    (F.col("location_name").rlike("^((?![Pp]rison).)*$")) &
    (F.col("location_name").rlike("^((?![Cc]ommuni).)*$")) &
    (F.col("location_name").rlike("^((?![Cc]lement).)*$")) &
    (F.col("location_name").rlike("^((?![Vv]iridian).)*$")) &
    (F.col("location_name").rlike("^((?![Dd]iocese).)*$")) &
    (F.col("location_name").rlike("^((?![Hh]istory).)*$")) &
    (F.col("location_name").rlike("^((?![Ss]chool).)*$")) &
    (F.col("location_name").rlike("^((?![Tt]hougt).)*$")) &
    (F.col("location_name").rlike("^((?![Hh]oliness).)*$")) &
    (F.col("location_name").rlike("^((?![Mm]artyr).)*$")) &
    (F.col("location_name").rlike("^((?![Jj]ames).)*$")) &
    (F.col("location_name").rlike("^((?![Ff]ellowship).)*$")) &
    (F.col("location_name").rlike("^((?![Hh]ouse).)*$")) &
    (F.col("location_name").rlike("^((?![Gg]lory).)*$")) &
    (F.col("location_name").rlike("^((?![Aa]nglican).)*$")) &
    (F.col("location_name").rlike("^((?![Pp]oetic).)*$")) &
    (F.col("location_name").rlike("^((?![Ss]anctuary).)*$")) &
    (F.col("location_name").rlike("^((?![Ee]quipping).)*$")) &
    (F.col("location_name").rlike("^((?![Jj]ohn).)*$")) &
    (F.col("location_name").rlike("^((?![Aa]ndrew).)*$")) &
    (F.col("location_name").rlike("^((?![Ee]manuel).)*$")) &
    (F.col("location_name").rlike("^((?![Rr]edeemed).)*$")) &
    (F.col("location_name").rlike("^((?![Pp]erfecting).)*$")) &
    (F.col("location_name").rlike("^((?![Aa]ngel).)*$")) &
    (F.col("location_name").rlike("^((?![Aa]rchangel).)*$")) &
    (F.col("location_name").rlike("^((?![Mm]icheal).)*$")) &
    (F.col("location_name").rlike("^((?![Tt]hought).)*$")) &
    (F.col("location_name").rlike("^((?![Pp]ariosse).)*$")) &
    (F.col("location_name").rlike("^((?![Cc]osmas).)*$")) &
    (F.col("location_name").rlike("^((?![Dd]eliverance).)*$")) &
    (F.col("location_name").rlike("^((?![Ss]ociete).)*$")) &
    (F.col("location_name").rlike("^((?![Tt]emple).)*$")) &
    (F.col("location_name").rlike("^((?![Ss]eminary).)*$")) &
    (F.col("location_name").rlike("^((?![Ee]mployment).)*$")) &
    (F.col("location_name").rlike("^((?![Ii]nstitute).)*$")) &
    (F.col("location_name").rlike("^((?![Cc]amp).)*$")) &
    (F.col("location_name").rlike("^((?![Ss]tudent).)*$")) &
    (F.col("location_name").rlike("^((?![Ee]ducation).)*$")) &
    (F.col("location_name").rlike("^((?![Ss]ocial).)*$")) &
    (F.col("location_name").rlike("^((?![Ww]welfare).)*$")) &
    (F.col("location_name").rlike("^((?![Cc][Ee][Ss]).)*$")) &
    (F.col("location_name").rlike("^((?![Ff]amily).)*$")) &
    (F.col("location_name").rlike("^((?![Mm]ary).)*$")) &
    (F.col("location_name").rlike("^((?![Rr]ussian).)*$")) &
    (F.col("location_name").rlike("^((?![Bb]eautif).)*$")) &
    (F.col("location_name").rlike("^((?![Hh]eaven).)*$")) &    
    (F.col("location_name").rlike("^((?!Inc).)*$")) &
    (F.col("location_name").rlike("^((?!God).)*$"))
  )
  
hathaway_churches.display()

placekey,poi_cbg,parent_placekey,location_name,brands,safegraph_brand_ids,store_id,top_category,sub_category,naics_code,open_hours,category_tags,latitude,longitude,street_address,city,region,postal_code,iso_country_code,opened_on,closed_on,tracking_closed_since,websites,phone_number,wkt_area_sq_meters
zzy-222@5wj-krc-dsq,160499400001.0,,The Church of Jesus Christ of Latter day Saints,,,,Religious Organizations,Religious Organizations,813110,,Churches,45.927403,-116.11182,Valley Vw,Kamiah,ID,83536,US,,,2019-07,[],12089352842.0,1362.0
zzw-222@5wq-s83-g8v,160679702001.0,,The Church of Jesus Christ of Latter day Saints,,,,Religious Organizations,Religious Organizations,813110,,Churches,42.557902,-113.792354,241 N Overland Ave,Burley,ID,83318,US,,,2019-07,"[""lds.org""]",12086780434.0,954.0
222-222@5wf-zyt-rhq,160439703001.0,,The Church of Jesus Christ of Latter day Saints,,,,Religious Organizations,Religious Organizations,813110,,Churches,43.967664,-111.680578,145 E 1st N,St. Anthony,ID,83445,US,,,2019-07,"[""lds.org""]",12086247540.0,1281.0
223-223@5w9-hwd-5mk,160010102252.0,,The Church of Jesus Christ of Latter day Saints,,,,Religious Organizations,Religious Organizations,813110,,Churches,43.695074,-116.345195,700 E State St,Eagle,ID,83616,US,,,2019-07,[],12089392988.0,522.0
zzy-222@5ws-j8j-6x5,160419701002.0,,The Church of Jesus Christ of Latter day Saints,,,,Religious Organizations,Religious Organizations,813110,,Churches,42.100707,-111.871853,213 N 2nd E,Preston,ID,83263,US,,,2019-07,"[""lds.org""]",12088521469.0,60.0
zzy-222@5wq-s89-9cq,,,The Church of Jesus Christ of Latter day Saints,,,,Religious Organizations,Religious Organizations,813110,,Churches,42.537956,-113.794962,213 W Main St,Burley,ID,83318,US,,,2019-07,"[""lds.org""]",12086542591.0,923.0
222-222@5ws-hhd-zpv,,,The Church of Jesus Christ of Latter day Saints,,,,Religious Organizations,Religious Organizations,813110,,Churches,42.230853,-111.95055,7389 N 3000 W,Preston,ID,83263,US,,,2019-07,"[""lds.org""]",12088522768.0,218.0
zzw-222@5wq-s83-g8v,,,The Church of Jesus Christ of Latter day Saints,,,,Religious Organizations,Religious Organizations,813110,,Churches,42.557902,-113.792354,241 N Overland Ave,Burley,ID,83318,US,,,2019-07,"[""lds.org""]",12086780434.0,954.0
zzy-222@5wr-7dm-9xq,160830009001.0,,The Church of Jesus Christ of Latter day Saints,,,,Religious Organizations,Religious Organizations,813110,,Churches,42.586804,-114.443502,2085 S Temple Dr,Twin Falls,ID,83301,US,,,2019-07,"[""mormon.org""]",12087333446.0,2467.0
zzy-222@5w9-jbw-p7q,160010001001.0,,The Church of Jesus Christ of Latter day Saints,,,,Religious Organizations,Religious Organizations,813110,,Churches,43.619514,-116.20184,Shamrock & Mcmillanrd,Boise,ID,83702,US,,,2019-07,"[""lds.org""]",12083773220.0,1210.0


In [0]:
# Exclude the temples
churches = hathaway_churches.join(idaho_temples, hathaway_churches.placekey == idaho_temples.placekey, how='left_anti')

### 3. Diagram of tables and columns used to build the feature

In [0]:
# Diagram of Tables and Columns
print("Church Schema: ")
churches.select(
    'placekey',
    'location_name',
).printSchema()
print()
print("Patterns Schema: ")
patterns.select(
    'placekey',
    'date_range_start',
    'raw_visit_counts',
    'popularity_by_day',
    'visitor_home_aggregation',
    'normalized_visits_by_state_scaling',
).printSchema()

Church Schema: 
root
 |-- placekey: string (nullable = true)
 |-- location_name: string (nullable = true)


Patterns Schema: 
root
 |-- placekey: string (nullable = true)
 |-- date_range_start: string (nullable = true)
 |-- raw_visit_counts: double (nullable = true)
 |-- popularity_by_day: map (nullable = true)
 |    |-- key: string
 |    |-- value: integer (valueContainsNull = true)
 |-- visitor_home_aggregation: map (nullable = true)
 |    |-- key: string
 |    |-- value: integer (valueContainsNull = true)
 |-- normalized_visits_by_state_scaling: double (nullable = true)



### 4. Code Snippet of Data Wrangling

In [0]:
""" 
Assumptions: 
The ratio of Sunday to other days is proportionate for different tracts. 
The state scaling ratio is accurate. 
"""
# Filter to just churches
church_patterns = patterns.join(churches, on='placekey', how='leftsemi')

# visitor_home_aggregation
home_agg = church_patterns.select(
    "*", 
    # Explodes the map of census tracts and visitor counts
    F.explode(
        F.col('visitor_home_aggregation')
        ).alias('tract', 'tract_visitors'),    
    # Gets the state scaling ratio
    (F.col('normalized_visits_by_state_scaling')/F.col('raw_visit_counts')).alias('state_scaling'),
    # Multiplies tract visitors by the state scaling ratio to get a more accurate total estimate
    (F.col('tract_visitors') * F.col('state_scaling')).alias('tract_visitors_scaled')
)

home_agg = home_agg.select(
    "*",
    # Gets the ratio of Sunday visits compared to all days of the week visits
    (F.expr(
        """popularity_by_day['Sunday'] / 
        (popularity_by_day['Monday'] 
        + popularity_by_day['Tuesday']
        + popularity_by_day['Wednesday']
        + popularity_by_day['Thursday']
        + popularity_by_day['Friday']
        + popularity_by_day['Saturday']
        + popularity_by_day['Sunday']
           )""")).alias('sunday_ratio'),
    # Calculates the Sunday visits by census tract
    (F.col('tract_visitors_scaled')*F.col('sunday_ratio')).alias('sunday_visitors')
    
)

home_agg.select('tract', 
                F.month(home_agg['date_range_start']).alias('month'),
                'tract_visitors', 
                'state_scaling', 
                'tract_visitors_scaled', 
                'sunday_ratio', 
                'sunday_visitors').display()

Church Schema: 
root
 |-- placekey: string (nullable = true)
 |-- location_name: string (nullable = true)


Patterns Schema: 
root
 |-- placekey: string (nullable = true)
 |-- date_range_start: string (nullable = true)
 |-- raw_visit_counts: double (nullable = true)
 |-- popularity_by_day: map (nullable = true)
 |    |-- key: string
 |    |-- value: integer (valueContainsNull = true)
 |-- visitor_home_aggregation: map (nullable = true)
 |    |-- key: string
 |    |-- value: integer (valueContainsNull = true)
 |-- normalized_visits_by_state_scaling: double (nullable = true)



tract,month,tract_visitors,state_scaling,tract_visitors_scaled,sunday_ratio,sunday_visitors
16001010331,12,32,16.110201850071867,515.5264592022997,0.6436170212765957,331.8016040610546
16001010313,12,4,16.110201850071867,64.44080740028747,0.6436170212765957,41.47520050763183
16001010225,12,4,16.110201850071867,64.44080740028747,0.6436170212765957,41.47520050763183
16001010334,12,4,16.110201850071867,64.44080740028747,0.6436170212765957,41.47520050763183
16027021903,12,4,16.110201850071867,64.44080740028747,0.6436170212765957,41.47520050763183
16015950200,12,4,16.110201850071867,64.44080740028747,0.6436170212765957,41.47520050763183
16001010335,12,4,16.110201850071867,64.44080740028747,0.6436170212765957,41.47520050763183
16065950301,7,4,16.40532143209921,65.62128572839684,0.75,49.21596429629763
16031950600,4,5,15.477198214149562,77.3859910707478,0.0,0.0
16005000400,4,39,15.477198214149562,603.6107303518329,0.4855072463768116,293.05738357661454


In [0]:
# This shows that the tract sum vs the visitor total is pretty close. 
home_agg.filter(
    'placekey=="222-222@5w9-pn2-kxq" AND date_range_start=="2019-11-01T00:00:00-06:00"'
    ).groupBy(
        'placekey'
        ).agg(
            (F.sum('tract_visitors_scaled')).alias('Tract Sum'), 
            (F.median('raw_visitor_counts')*F.median('state_scaling')).alias('Visitor Total')
            ).display()

placekey,Tract Sum,Visitor Total


In [0]:
# Calculates the Sunday visits by tract
final_df = home_agg.filter(
    F.col('tract').startswith('16') # 16 is Idaho's tract number
    
).select(
    "*",
    F.month(home_agg['date_range_start']).alias('month')

).groupBy(
    'tract', 
    'month',

).agg(
    F.sum('sunday_visitors').alias('sum_sunday_visitors'),
    F.median('distance_from_home').alias('median_distance_from_home')
)

final_df.display()

tract,month,sum_sunday_visitors,median_distance_from_home
16001000100,2,58.98440599060254,2549.0
16001000100,3,45.22172105879126,4738.0
16001000100,4,30.924086125798947,4376.0
16001000100,5,61.512978311543336,5619.0
16001000100,6,13.910846523960103,10558.5
16001000100,7,22.579367132351603,8817.0
16001000100,8,19.09624759908985,6948.0
16001000100,9,17.658216653130143,3526.0
16001000100,10,31.18245652723247,6440.0
16001000100,11,25.917201911732157,6291.0


In [0]:
# Calculates the Sunday visits by place (for verification purposes)
home_agg_churches = home_agg.filter(
    F.col('tract').startswith('16') # 16 is Idaho's tract number
    
    ).join(churches, ['placekey'], how='inner') # Join with churches

    
by_place = home_agg_churches.select(
    "*",
    F.month(home_agg_churches['date_range_start']).alias('month')
).groupBy(
    'placekey',
    'month'

).agg(
    F.sum('sunday_visitors').alias('sum_sunday_visitors'),
)

graph_df = home_agg_churches.select(
    "*",
).groupBy(
    'tract', 
).agg(
    (F.sum('sunday_visitors')/11).alias('sum_sunday_visitors'), # 11 months in the safegraph data
)



In [0]:
# Validate number of census tracts in Idaho, should be 298
# visits_by_tract.filter(F.col('tract').startswith('16')).count()
home_agg_churches.select(F.col('tract')).filter(F.col('tract').startswith('16')).distinct().count()

Out[46]: 289

In [0]:
# Megan's church
by_place.filter('placekey=="zzw-222@5w9-hmg-pgk"').display()
# 500-1000 Sunday visitors each month -- seems reasonable. 

placekey,month,sum_sunday_visitors
zzw-222@5w9-hmg-pgk,10,535.0213819041235
zzw-222@5w9-hmg-pgk,8,655.7375776110822
zzw-222@5w9-hmg-pgk,3,801.2358515964291
zzw-222@5w9-hmg-pgk,11,900.904475149124
zzw-222@5w9-hmg-pgk,9,835.5301956222046
zzw-222@5w9-hmg-pgk,12,1009.604424817226
zzw-222@5w9-hmg-pgk,4,550.5460507604629
zzw-222@5w9-hmg-pgk,2,554.6678597472927
zzw-222@5w9-hmg-pgk,5,557.769903708474
zzw-222@5w9-hmg-pgk,6,973.6112893368391


### 5. Visualizations

In [0]:
!pip install geopandas
!pip install folium

You should consider upgrading via the '/local_disk0/.ephemeral_nfs/envs/pythonEnv-5b8559c0-2922-4b62-8f32-7a66810be151/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/local_disk0/.ephemeral_nfs/envs/pythonEnv-5b8559c0-2922-4b62-8f32-7a66810be151/bin/python -m pip install --upgrade pip' command.[0m


In [0]:
import geopandas as gpd
import folium

dbutils.fs.cp("dbfs:/FileStore/tl_2019_16_tract", "file:/tmp/tl_2019_16_tract", recurse = True)
gdf = gpd.read_file("/tmp/tl_2019_16_tract")
# Join graph_df and gdf
gdf = gdf.merge(graph_df.toPandas(), left_on='GEOID', right_on='tract')

In [0]:
# gdf
# Assuming gdf is your GeoPandas DataFrame containing census tracts
m = folium.Map(location=[45.697873, -114.346173], zoom_start=6)
# Convert the GeoDataFrame to a GeoJSON and add it to the folium map
gdf_json = gdf.to_json()
choropleth = folium.Choropleth(
    geo_data=gdf_json,
    data=gdf,
    columns=['tract', 'sum_sunday_visitors'],  
    key_on='feature.properties.tract', 
    fill_color='YlGn', 
    fill_opacity=0.8,
    line_opacity=0.5,
    legend_name='Sunday Visits'
)

# tooltip = folium.GeoJsonTooltip(fields=['tract', 'sum_sunday_visitors'], aliases=['Tract', 'Visitors'], localize=True)

# choropleth.add_child(tooltip)
choropleth.add_to(m)

# m


[0;31m---------------------------------------------------------------------------[0m
[0;31mNameError[0m                                 Traceback (most recent call last)
File [0;32m<command-560169232424006>:3[0m
[1;32m      1[0m [38;5;66;03m# gdf[39;00m
[1;32m      2[0m [38;5;66;03m# Assuming gdf is your GeoPandas DataFrame containing census tracts[39;00m
[0;32m----> 3[0m m [38;5;241m=[39m folium[38;5;241m.[39mMap(location[38;5;241m=[39m[[38;5;241m45.697873[39m, [38;5;241m-[39m[38;5;241m114.346173[39m], zoom_start[38;5;241m=[39m[38;5;241m6[39m)
[1;32m      4[0m [38;5;66;03m# Convert the GeoDataFrame to a GeoJSON and add it to the folium map[39;00m
[1;32m      5[0m gdf_json [38;5;241m=[39m gdf[38;5;241m.[39mto_json()

[0;31mNameError[0m: name 'folium' is not defined

<img src = '/files/Screenshot_2023_10_31_at_8_33_14_PM.png'>
<img src ='/files/newplot__1_.png'>


### 6. Display of the first five rows of your feature table used in the visualizations

In [0]:
display(graph_df.orderBy('tract').head(5))

tract,sum_sunday_visitors
16001000100,38.46542862168799
16001000201,22.925532081082704
16001000202,66.53805417020122
16001000302,51.69955177909152
16001000303,17.566378663817087
