# Boston Crime Analysis & Map Visualizations
## Developed by Charles Karafotias
### CMSC320-0101: Introduction To Data Science Final Project

# Introduction
Crime has always been an issue in the United States of America. Whether it be gun violence, trespassing offenses, or more serious felony offenses, crime rates are a major consideration in choosing where to live and when citizens deem it safe to exit the safety of their homes. In the United States overall, crime rates have been on the decline according to Pew Research Center (read more [here](https://www.pewresearch.org/fact-tank/2020/11/20/facts-about-crime-in-the-u-s/)) yet "Americans tend to believe crime is up" even with the extensive research that has been done into this topic (Gramlich). With the changing times since the publication of the previously mentioned article and the ongoing effects of the COVID-19 pandemic, I am interested in analyzing the affects of crime rates that currently exist in a major city of the United States. 

Of the many crime datasets that are available to the public, I am particularly interested in analyzing the city of Boston, Massachusetts crime rates and finding out if there are patterns in the types of crimes that exist in different neighborhoods of the city. Provided by the city of Boston, the **Crime Incident Reports (August 2015 - To Date) (Source: New System)** dataset located [here](https://data.boston.gov/dataset/crime-incident-reports-august-2015-to-date-source-new-system) provides a plethora of data for analysis. I specifically chose to analyze the crime in the city of Boston as there is 44 higher level institutions in the metropolitan area alone and many others around the city (more information [here](https://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_metropolitan_Boston)). With all of the student population coming and going during the college semesters, I pose the following questions that this project will aim to answer:
1.  Does the crime rate increase during the college semesters (Fall and Spring semesters)?
2.  What are the most popular crimes during the college semesters in the most populated college neighborhoods?
3.  How does the crime rate differ between the work week and the weekend during the college semester?
4.  During college breaks, does the city of Boston experience different frequencies of common crimes?
5.  How do different neighborhoods crime frequency compare?

In order to answer the above questions, I walk through the data science pipeline. This consists of data collection, data processing, exploratory analysis and data visualizations, analysis & hypothesis testing, and decision making based off the results. Each section is walked through below.

## Requirements For Code
The language selected for this project is Python. With Python, the above questions can be answered using the pipeline provided. I elected to use Python as there are many additional libraries (listed below) that provide valuable tools for data science. I have provided a list of the libraries that have been used throughout the project and links to the accompanying API for further reading. 
1. [Numpy](https://numpy.org/doc/stable/reference/)
2. [Pandas](https://pandas.pydata.org/docs/reference/index.html)
3. [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/#)
4. [os](https://docs.python.org/3/library/os.html)
5. [requests](https://docs.python-requests.org/en/latest/)
6. [shutil](https://docs.python.org/3/library/shutil.html#module-shutil)

# Data Collection
The first step of the data science pipeline is to collect the data. In this process, the datasets that are needed for the project are downloaded or gatherered from their respective sources. For this process, I have developed a script that will organize the data into the proper format if the data is available from the Boston city website. This will ensure that there is always updated data, as the dataset is being updated every year. 

The script below begins by importing the required libraries and modules. These are requests, os, shutil, and BeautifulSoup. The requests and BeautifulSoup libraries are needed in order to access the Boston crime dataset website and the BeautifulSoup library is used in order to parse the content that is returned by the requests library get call. The os and shutil modules were used to make simple calls such as making a directory and removing old directories. 

Following import calls, the script has the following flow:
1. Declaration of a request object where the get() function is called with the url to the Boston crime data.
2. If the website returns a 200 code, meaning that the website is up, then it is safe to interact with the webpage. If the response is not 200, then the script looks for old data. This would be located at './Data'. If this exists, this data is used. If both do not exist, the the program errors out and informs the user promptly.
3. When 200 code is returned, a BeautifulSoup object is instantiated and is set to read the html that is returned from the requests. From here, the script finds all links that are stored with the class id 'btn btn-primary'. This class contains all of the links to the .csv and .xlsx files that will need to be downloaded for analysis in the project later on. 
4. Once the links are found, they are stored in a list called links. Following this, if the old folder called 'Data' still exists, this is replaced with a new 'Data' folder. If the folder doesn't exist, then the new folder is created. Then, the csv files are downloaded and stored in the 'Data' folder that was created with the proper naming. The scheme for the naming is c1, c2, ... representing 'crime' and the 'link number'. 

In [1]:
# Import the needed libraries for this section. This allows for obtaining the HTML and parsing the HTML code
import requests
from bs4 import BeautifulSoup 
import os
import shutil

r = requests.get('https://data.boston.gov/dataset/crime-incident-reports-august-2015-to-date-source-new-system')
if r.status_code == 200:
    soup = BeautifulSoup(r.content, 'html.parser')
    links = []
    # Find all href links that contain the class btn btn-primary These are the ones that are downloadable .csv 
    # files that have the data needed for project
    for link in soup("a", "btn btn-primary", href=True):
        links.append(link['href'])

    # Download each of the links, then store in proper place
    if 'Data' in os.listdir('.'):
        # remove Data folder, make new folder then add in .csv files from online
        shutil.rmtree('./Data')
    
    # make filder then add in .csv files
    os.mkdir('./Data')
    for index,curr_url in enumerate(links):
        r = requests.get(curr_url, allow_redirects=True)
        fileExt = curr_url.split('.')[-1]
        fileName = f'./Data/c{index+1}.{fileExt}'
        open(fileName, 'wb').write(r.content)       
elif 'Data' in os.listdir('.'):
    # Use this data instead
    print('Using old dataset saved. Failed to retrieve updated data from website.')
else:
    raise Exception('Failed to retrieve data from website and failed to find Data directory in current directory')

# Data Parsing & Organization 

# Exploratory Data Analysis

# Modeling Future Crime

# Map Visualizations

# References
Gramlich, John. “What the Data Says (and Doesn't Say) about Crime in the United States.” Pew Research Center, The Pew 
Charitable Trusts, 23 Nov. 2020, https://www.pewresearch.org/fact-tank/2020/11/20/facts-about-crime-in-the-u-s/. 

