## New York City Crime Statistics - Web-Scrapping Project

This code has been written in **Python 3.8.13** and executed on a M1 Mac \
Depending on the OS used for execution a different Chrome Driver might be needed to execute succesfully

<img src="Images/Compstat 2.0 Example.png"></img>

In the image above, we can see the Compstat 2.0 Website. On the left of the website, in the **CompStat Book** 
section, it can be specified what data points should be shown on the **Incident Map** in the middle. (In the 
image above, I selected the **<u>Total</u>** Number of Crimes that Happened in the **<u>Week to Date</u>** 
**<u>2022</u>** from 06/13/2022 - 06/19/2022). With each circle, there are relational x and y coordinates 
as well as a specific and broad cime label and the time of the occurence of the crime. Each circle on the 
map has to be clicked individually to derive the previously mentioned data from the HTML code of the website.

### Objective
- first, specify a category and time frame of crimes that should be displayed on the map
- loop through all circles on the map to get the time, location, and specific/broad crime label

### Implementation

In [None]:
!pip install selenium tqdm pandas logging

- importing all relevant packages for the scrapping of the Compstat 2.0 website

In [3]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

import pandas as pd
import time as tm
import re

from tqdm import tqdm
import logging as log

- The following function sets up a chrome instance to use for the website interactions

In [4]:
def settingUpChrome():
    """Creates an instance of the chrome browser and opens up the Compstat2.0 website"""
    #option = webdriver.ChromeOptions()
    #option.add_argument('headless')
    global driver
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
    driver.get("https://compstat.nypdonline.org/")
    driver.maximize_window()

- This function finds the relevant elements on the Compstat 2.0 website and assigns them to global variables so that they can be interacted with.
- It also creates an Dictionary where all elements that specify the different types of data categories are being saved.

In [5]:
def loadElementsOnSite():
    """Loads all neccesary elements on the site that will be interacted with and assigns them to variables"""
    global wait
    wait = WebDriverWait(driver, 20)
    global wait_for_element
    def wait_for_element(locator_type:str,locator:str):
            return wait.until(EC.presence_of_element_located((eval(locator_type), locator))) 

    global search_category_dict 
    search_category_dict = {}

    global general_element_dictionary
    general_element_dictionary = {}

    try:
        # Totals
        global total_week_to_date_2022
        total_week_to_date_2022 = wait_for_element("By.XPATH",\
             '//*[@id="report-win1"]/div[3]/div[2]/div/div/div[3]/div[1]/table/tbody/tr[9]/td[1]/div/div')

        search_category_dict["total_week_to_date_2021"] = wait_for_element("By.XPATH",\
             '//*[@id="report-win1"]/div[3]/div[2]/div/div/div[3]/div[1]/table/tbody/tr[9]/td[2]/div/div')

        search_category_dict["total_28_day_2022"] = wait_for_element("By.XPATH",\
             '//*[@id="report-win1"]/div[3]/div[2]/div/div/div[3]/div[1]/table/tbody/tr[9]/td[4]/div/div')

        search_category_dict["total_28_day_2021"] = wait_for_element("By.XPATH",\
             '//*[@id="report-win1"]/div[3]/div[2]/div/div/div[3]/div[1]/table/tbody/tr[9]/td[5]/div/div')

        search_category_dict["total_year_to_date_2022"] = wait_for_element("By.XPATH",\
             '//*[@id="report-win1"]/div[3]/div[2]/div/div/div[3]/div[1]/table/tbody/tr[9]/td[7]/div/div')

        search_category_dict["total_year_to_date_2021"] = wait_for_element("By.XPATH",\
             '//*[@id="report-win1"]/div[3]/div[2]/div/div/div[3]/div[1]/table/tbody/tr[9]/td[8]/div/div')

        # Patrols
        search_category_dict["patrol_week_to_date_2022"] = wait_for_element("By.XPATH",\
             '//*[@id="report-win1"]/div[3]/div[2]/div/div/div[3]/div[1]/table/tbody/tr[10]/td[1]/div/div')

        search_category_dict["patrol_week_to_date_2021"] = wait_for_element("By.XPATH",\
             '//*[@id="report-win1"]/div[3]/div[2]/div/div/div[3]/div[1]/table/tbody/tr[10]/td[2]/div/div')

        search_category_dict["patrol_28_day_2022"] = wait_for_element("By.XPATH",\
             '//*[@id="report-win1"]/div[3]/div[2]/div/div/div[3]/div[1]/table/tbody/tr[10]/td[4]/div/div')

        search_category_dict["patrol_28_day_2021"] = wait_for_element("By.XPATH",\
             '//*[@id="report-win1"]/div[3]/div[2]/div/div/div[3]/div[1]/table/tbody/tr[10]/td[5]/div/div')

        search_category_dict["patrol_year_to_date_2022"] = wait_for_element("By.XPATH",\
             '//*[@id="report-win1"]/div[3]/div[2]/div/div/div[3]/div[1]/table/tbody/tr[10]/td[7]/div/div')
        
        search_category_dict["patrol_year_to_date_2021"] = wait_for_element("By.XPATH",\
             '//*[@id="report-win1"]/div[3]/div[2]/div/div/div[3]/div[1]/table/tbody/tr[10]/td[8]/div/div')

        # Transits
        search_category_dict["transit_week_to_date_2022"] = wait_for_element("By.XPATH",\
             '//*[@id="report-win1"]/div[3]/div[2]/div/div/div[3]/div[1]/table/tbody/tr[11]/td[1]/div/div')

        search_category_dict["transit_week_to_date_2021"] = wait_for_element("By.XPATH",\
             '//*[@id="report-win1"]/div[3]/div[2]/div/div/div[3]/div[1]/table/tbody/tr[11]/td[2]/div/div')

        search_category_dict["transit_28_day_2022"] = wait_for_element("By.XPATH",\
             '//*[@id="report-win1"]/div[3]/div[2]/div/div/div[3]/div[1]/table/tbody/tr[11]/td[4]/div/div')

        search_category_dict["transit_28_day_2021"] = wait_for_element("By.XPATH",\
             '//*[@id="report-win1"]/div[3]/div[2]/div/div/div[3]/div[1]/table/tbody/tr[11]/td[5]/div/div')

        search_category_dict["transit_year_to_date_2022"] = wait_for_element("By.XPATH",\
             '//*[@id="report-win1"]/div[3]/div[2]/div/div/div[3]/div[1]/table/tbody/tr[11]/td[7]/div/div')
        
        search_category_dict["transit_year_to_date_2021"] = wait_for_element("By.XPATH",\
             '//*[@id="report-win1"]/div[3]/div[2]/div/div/div[3]/div[1]/table/tbody/tr[11]/td[8]/div/div')

        # Housings
        search_category_dict["housing_week_to_date_2022"] = wait_for_element("By.XPATH",\
             '//*[@id="report-win1"]/div[3]/div[2]/div/div/div[3]/div[1]/table/tbody/tr[12]/td[1]/div/div')

        search_category_dict["housing_week_to_date_2021"] = wait_for_element("By.XPATH",\
             '//*[@id="report-win1"]/div[3]/div[2]/div/div/div[3]/div[1]/table/tbody/tr[12]/td[2]/div/div')

        search_category_dict["housing_28_day_2022"] = wait_for_element("By.XPATH",\
             '//*[@id="report-win1"]/div[3]/div[2]/div/div/div[3]/div[1]/table/tbody/tr[12]/td[4]/div/div')

        search_category_dict["housing_28_day_2021"] = wait_for_element("By.XPATH",\
             '//*[@id="report-win1"]/div[3]/div[2]/div/div/div[3]/div[1]/table/tbody/tr[12]/td[5]/div/div')

        search_category_dict["housing_year_to_date_2022"] = wait_for_element("By.XPATH",\
             '//*[@id="report-win1"]/div[3]/div[2]/div/div/div[3]/div[1]/table/tbody/tr[12]/td[7]/div/div')
        
        search_category_dict["housing_year_to_date_2021"] = wait_for_element("By.XPATH",\
             '//*[@id="report-win1"]/div[3]/div[2]/div/div/div[3]/div[1]/table/tbody/tr[12]/td[8]/div/div')


        global wait_for_cicles_to_appear
        wait_for_cicles_to_appear = wait.until(
            EC.presence_of_element_located((By.CSS_SELECTOR,\
                 '#report-win13 > div.window-content > div.full-height > div > div.full-height.k-map > div.km-widget.km-scroll-wrapper > div.km-scroll-container > div:nth-child(4) > svg > circle:nth-child(2)')))

        global first_circle
        first_circle = wait.until(
            EC.presence_of_element_located((By.CSS_SELECTOR,\
                 '#report-win13 > div.window-content > div.full-height > div > div.full-height.k-map > div.km-widget.km-scroll-wrapper > div.km-scroll-container > div:nth-child(4) > svg')))                                                     

        global zoom_out
        zoom_out = wait.until(
            EC.presence_of_element_located((By.CSS_SELECTOR,\
                 '#report-win13 > div.window-content > div.full-height > div > div.full-height.k-map > div.k-map-controls.k-pos-top.k-pos-left > div > button.k-button.k-zoom-out')))

        global zoom_in
        zoom_in = wait.until(
            EC.presence_of_element_located((By.CSS_SELECTOR,\
                 '#report-win13 > div.window-content > div.full-height > div > div.full-height.k-map > div.k-map-controls.k-pos-top.k-pos-left > div > button.k-button.k-zoom-in')))

        global incident_map_logo
        incident_map_logo = wait.until(
            EC.presence_of_element_located((By.CSS_SELECTOR,\
                 '#report-win13 > div.window-header > h4')))
        
        global map_mover
        map_mover = wait.until(
            EC.presence_of_element_located((By.CSS_SELECTOR,\
                 '#report-win13 > div.window-content > div.full-height > div > div.full-height.k-map > div.km-widget.km-scroll-wrapper > div.km-scroll-container'))) 
        
        global toggle_large_map
        toggle_large_map = wait.until(
            EC.presence_of_element_located((By.XPATH,\
                 '//*[@id="report-win13"]/div[2]/a/span')))

        global map_element
        map_element = wait.until(
            EC.presence_of_element_located((By.CSS_SELECTOR,\
                 '#report-win13 > div.window-content > div.full-height > div > div.full-height.k-map')))

    finally:
        log.info("loading of Elements has finished")

This function allows to move on the SVG map horizontaly and vertically

In [6]:
def moveOnMap(direction: str, scalar):
    """allows horizontal and vertical movement on the map; magnitude of movement is controlled by the scalar"""
    map_element = wait.until(
        EC.presence_of_element_located((By.CSS_SELECTOR, '#report-win13 > div.window-content > div.full-height > div > div.full-height.k-map')))

    if direction == "moveUp":
        webdriver.ActionChains(driver).move_to_element(map_element).click_and_hold().move_by_offset(0,1*scalar).release().perform() 
    elif direction == "moveDown":
        webdriver.ActionChains(driver).move_to_element(map_element).click_and_hold().move_by_offset(0,-1*scalar).release().perform()
    elif direction == "moveLeft":    
        webdriver.ActionChains(driver).move_to_element(map_element).click_and_hold().move_by_offset(1*scalar,0).release().perform() 
    elif direction == "moveRight":    
        webdriver.ActionChains(driver).move_to_element(map_element).click_and_hold().move_by_offset(-1*scalar,0).release().perform()
    else:
        print("please specify a direction that can be interpreted")    

This function does a number of things:
- enlarges the SVG map to increase the viewing field of circles on the map
- on the map zooms out a bit to make sure all circles are within the visible frame of the map (hence clickable)
- uses JavaScript to set the `.style.visibility` of every circle to `hidden`
- then, we iterate through the circles, and set their visibility to `visible` one after the other and click the circle on the map
    - 1. We do it this way because the data we are looking for only appears in the HTML for a single circle (the one that has been clicked)
    - 2. We make every circle invisible so that there is no other circle that overlaps the circle that we intend to click
- after a circle has been clicked and the data has been extracted, we set the visibility back to `hidden` and move to the next circle


In [7]:
def selectDataOfChategory(verbose: bool):
    """
    Parameters:
    category (WebElement): can be one of the cells on the left side of the Compstat2.0 website that specify agregate numbers of crimes

    Returns:
    DataFrame: of all crimes that belong to the given category that has been specified before including: (location, time, specific crime label, broad crime label)
    """
    def map_loading_disapear():
        loading_disapear = wait.until(
            EC.invisibility_of_element_located((By.CSS_SELECTOR,\
                '#report-win13 > div.k-loading-mask > div.k-loading-image')))


    def make_all_disapear():
        circles = wait.until(
            EC.presence_of_all_elements_located((By.CSS_SELECTOR,\
                    '#report-win13 > div.window-content > div.full-height > div > div.full-height.k-map > div.km-widget.km-scroll-wrapper > div.km-scroll-container > div:nth-child(4) > svg > circle')))
        
        driver.execute_script("var all_circles = document.getElementsByTagName('circle');\
                            for (i = 0; i < all_circles.length; i++){all_circles[i].style.visibility = 'hidden'};")

        incident_map_logo.click()

    try:
        ########
        
        total_week_to_date_2022.click()

        ########


        map_loading_disapear()
        toggle_large_map.click()
        tm.sleep(0.2)
        zoom_out.click()
        moveOnMap("moveRight", 60)
        webdriver.ActionChains(driver).move_to_element(first_circle).click(first_circle).perform()
        #moveOnMap("moveRight", 70)

        circles = wait.until(
            EC.presence_of_all_elements_located((By.CSS_SELECTOR, '#report-win13 > div.window-content > div.full-height > div > div.full-height.k-map > div.km-widget.km-scroll-wrapper > div.km-scroll-container > div:nth-child(4) > svg > circle'))
        )                                                          

        print("number of circles: ", len(circles)) if verbose else None

        global data_frame
        data_frame = pd.DataFrame(columns=["x_cord","y_cord","time","crime","crime_category"])
        loop_crime_counter = 0
        text_crime_counter = 0

        
        for i in tqdm(range(len(circles))):

            make_all_disapear()

            print("circle number: ", i) if verbose else None
            
            driver.execute_script(f"var all_circles = document.getElementsByTagName('circle');\
                                    all_circles[{i}].style.visibility = 'visible';")

            given_circle = wait.until(
                EC.presence_of_element_located((By.CSS_SELECTOR, f'#report-win13 > div.window-content > div.full-height > div > div.full-height.k-map > div.km-widget.km-scroll-wrapper > div.km-scroll-container > div:nth-child(4) > svg > circle:nth-child({i+2})'))
            )

            webdriver.ActionChains(driver).move_to_element(given_circle).click(given_circle).perform()

            y_coordinate = given_circle.get_attribute("cy")
            x_coordinate = given_circle.get_attribute("cx")

            information_box = wait.until(
                EC.presence_of_element_located((By.ID, "map-popup-13")))

            print() if verbose else None
            print() if verbose else None
            print("------COORDINATES------") if verbose else None
            print(i, given_circle.get_attribute("outerHTML")) if verbose else None
            print() if verbose else None
            print("y_coordinate: ", y_coordinate) if verbose else None
            print("x_coordinate: ", x_coordinate) if verbose else None

            try:
                #if there is a h6 header in the informationbox then there are multiple crimes at the same location
                h6_header = information_box.find_element(By.TAG_NAME, "h6")

            except:
                h6_header = False

            if h6_header:
                n_of_crimes = int(re.findall(r'-?\d+\.?\d*', h6_header.get_attribute("innerHTML"))[0])
                text_crime_counter += n_of_crimes

                multi_crime_element = information_box.find_elements(By.XPATH, '//div[@style="border-bottom: 1px solid gray;"]')
                
                for j, element in enumerate(multi_crime_element):
                    loop_crime_counter += 1
                    print(element.get_attribute("innerHTML")) if verbose else None
                    t = 0
                    tm.sleep(0.1)
                    
                    crime_label_HTML = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,f'#map-popup-13 > div > div:nth-child({j+2}) > div:nth-child({t+1}) > span')))
                    
                    
                    if len(crime_label_HTML.get_attribute("title"))>0:
                        crime_label = crime_label_HTML.get_attribute("title") 
                    else:
                        crime_label = crime_label_HTML.text
            
                    print(crime_label) if verbose else None

                    crime_category_label_HTML = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '#map-popup-13 > div > div:nth-child(1) > h5')))
                    crime_category_label = crime_category_label_HTML.text
                    
                    date_time_label_HTML = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,f"#map-popup-13 > div > div:nth-child({j+2}) > div:nth-child({t+2}) > span")))
                    date_time_label = date_time_label_HTML.text

                    data_frame.loc[loop_crime_counter] = pd.Series({"x_cord":x_coordinate,"y_cord":y_coordinate,"time":date_time_label,"crime":crime_label, "crime_category":crime_category_label})
                    print(loop_crime_counter) if verbose else None

            else:
                print("single crime: ----> ") if verbose else None
                loop_crime_counter += 1
                text_crime_counter += 1

                tm.sleep(0.1)
                crime_label_HTML = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,'#map-popup-13 > div > div:nth-child(2) > div:nth-child(1) > span')))

                print("thats what I'm looking for: ", crime_label_HTML.text) if verbose else None


                if len(crime_label_HTML.get_attribute("title"))>0:
                    crime_label = crime_label_HTML.get_attribute("title") 
                else:
                    crime_label = crime_label_HTML.text

                crime_category_label_HTML = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '#map-popup-13 > div > div:nth-child(1) > h5')))
                crime_category_label = crime_category_label_HTML.text

                print(crime_label) if verbose else None

                date_time_label_HTML = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,'#map-popup-13 > div > div:nth-child(2) > div:nth-child(2) > span')))
                date_time_label = date_time_label_HTML.text

                data_frame.loc[loop_crime_counter] = pd.Series({"x_cord":x_coordinate,"y_cord":y_coordinate,"time":date_time_label,"crime":crime_label, "crime_category":crime_category_label})


                log.info
                print("loop_crime_counter: ", loop_crime_counter) if verbose else None
                print("text_crime_counter: ", text_crime_counter) if verbose else None
                
            print() if verbose else None
            print() if verbose else None
            print("------ATTRIBUTES------") if verbose else None
            print(information_box.get_attribute("innerHTML")) if verbose else None
            print() if verbose else None
            print() if verbose else None

    finally:
        print("complete")
        driver.quit()

executing all three function together

In [None]:
settingUpChrome()
loadElementsOnSite()
selectDataOfChategory(verbose=False)

In [None]:
data_frame.head(40)

In [None]:
data_frame.to_csv('scraped-data/total_week_to_date_2022.csv')

### In Planning

- database implementation
- parallel execution via multiple taps to increase scrapping speed
- data exploration
- data visualization
- machine-learning application