**A1 part 4**:
- The temporal activity for `PROSTITUTION` shows a surprising pattern on a Thursday. Either there's a lot of prostitution going on in San Francisco on Thursdays, or there's something wrong with the data. It could be the case that reports done during the week were not registered until that day, or that the data was not properly recorded. It could also be the case that the same crime was reported several times. This leads to a bias in the data, and if not noticed, it could lead to misconceptions about the underlying patterns of prostitution during the week.
- The jitter plot shows that many more crimes were recorded e.g. on the hour, 15 minutes past the hour, and to a lesser in whole increments of 10 minutes. Crimes didn't appear to be recorded as frequently in between those round numbers. It's a common human habit to round the time to the nearest 5 or 10 minutes, and this could lead to a bias in the data. If the exact time couldn't be recalled, then rounding to the nearest half or whole hour would be the most likely option. This habit of rounding the time leads to a bias in the data. Consequently, if we were to model the daily crime rate, our predictions would most likely pick up on the rounding pattern, and we wouldn't be able to accurately predict the crime occurences throughout the day.
- The *Hall of Justice* on the 800 block of Bryant street seems to be an unlikely hotspot, since it's right next to the penetentiary. The most likely explanation of this error, is that sex offences were registered at the penetentiary, but the location was not properly recorded, so the penententiary was used as a default location. The bias in the data will be picked up by our visualization tools, so one could easily be misled to believe that the Hall of Justice is a hotspot for sex offences, if not noticed.
- As shown in the heatmap in the following code, the most frequent reportings of sex offenses (non forcible) occour at the San Francisco General Hospital at Potrero Avenue. It is extremely unlikely that sex offenses would occur at a place like this, so similar to the first example, it's a result of reportings taking place at the hospital, some time after the actual incident. Consequently, one could be misled to believe that the hospital would be a hotspot for sex offenses, which could have serious consequences for model predictions. As an example, if we were to distribute police officers based on the crime reports, we would most likely distribute more officers to the hospital, which would be a waste of resources.
- LLMs were used to help understanding the folium framework, to create the interactive maps aswell as the heatmaps.

In [2]:
import numpy as np
import pandas as pd
import folium
from folium.plugins import HeatMap
import os

In [9]:
#get current working directory
cwd = os.getcwd()
#get parent directory
parent = os.path.dirname(cwd)
#get files directory
files = os.path.join(parent, 'files')
#get police department data as pandas dataframe
police = pd.read_csv(os.path.join(files, 'Police_Department_Incident_Reports__Historical_2003_to_May_2018_20240130.csv'))
police["Date"] = pd.to_datetime(police["Date"])


In [10]:
#GET SAN FRANCISCO COORDINATES
lat = 37.773972
lon = -122.431297
# create map and display it
sanfran_map = folium.Map(location=[lat, lon], zoom_start=12)

In [11]:
#Get all instances of SEX OFFENSES, NON FORCIBLE across all years
sex_crime = police[police['Category'] == 'SEX OFFENSES, NON FORCIBLE']

heat_data = list(zip(sex_crime['Y'],sex_crime['X']))
HeatMap(heat_data).add_to(sanfran_map)
sanfran_map