# Crimes in Boston Analysis 

Project from Kaggle Challenge : https://www.kaggle.com/datasets/AnalyzeBoston/crimes-in-boston

In [2]:
import sqlite3

In [3]:
import numpy as np
import pandas as pd

In [4]:
import matplotlib.pyplot as plt
import seaborn as sns

In [5]:
import statsmodels.formula.api as smf

# Analysis for the mayor's team

During the last municipal campaign in Boston, criminality has been a major topic of debates. As citizens have expressed strong expectations from her on that front, the newly-elected mayor of Boston is looking for data-based insights on criminality in the Massachussetts capital. She has mandated your economics and urbanism consulting firm, The Locomotive, for this study.

## Load Database

In [13]:
# Download database
!curl https://wagon-public-datasets.s3.amazonaws.com/certification_france_2021_q2/boston_crimes.sqlite > db/boston_crimes.sqlite

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 19.1M  100 19.1M    0     0  11.2M      0  0:00:01  0:00:01 --:--:-- 11.2M


## Explore the database

Three tables are available :

- the incident_reports table has been provided by the Police Department of Boston. Each observation corresponds to a criminal incident that has required an intervention by the police in the municipality of Boston;
- the districts table has been provided by the Urbanism Department of Boston. It gathers geographical information about the various police districts of Boston;
- and the indicators table has been shared by the Economics Department of Boston, which keeps track of various indicators of the social and economic activity of Boston neighborhoods. Each observation corresponds to a police district.

**==> I'm using DBeaver to analyse these three tables**

**==> A schema of the database is available in the db folder**

## Extract dataset

We want to investigate the influence of the socio-economic characteristics of Boston's different districts on the number of crime reports and incidents. To do so, we need to extract the relevant dataset. Each row should correspond to one of the 12 police districts of Boston (as listed in the districts table of the database).

To identify the district, we will need the following columns:

- the CODE of the police district (1 letter and 1 or 2 numbers);
- the full NAME of the police district.
- NB_INCIDENTS, i.e. the total number of incidents reported in the police district over the period covered by the data at hand (2015-2019)
- several socio-economic indicators: 
MEDIAN_AGE;
TOTAL_POP;
PERC_OF_30_34;
PERC_MARRIED_COUPLE_FAMILY;
PER_CAPITA_INCOME;
PERC_OTHER_STATE_OR_ABROAD;
PERC_LESS_THAN_HIGH_SCHOOL;
PERC_COLLEGE_GRADUATES.

In [16]:
# SQL query to build the dataset
query = """
    SELECT
        districts.CODE,
        districts.NAME, 
        COUNT(incident_reports.INCIDENT_NUMBER) AS NB_INCIDENTS,
        indicators.MEDIAN_AGE,
        indicators.TOTAL_POP,
        indicators.PERC_OF_30_34,
        indicators.PERC_MARRIED_COUPLE_FAMILY,
        indicators.PER_CAPITA_INCOME,
        indicators.PERC_OTHER_STATE_OR_ABROAD,
        indicators.PERC_LESS_THAN_HIGH_SCHOOL,
        indicators.PERC_COLLEGE_GRADUATES
    FROM districts
    JOIN incident_reports ON incident_reports.DISTRICT = districts.CODE 
    JOIN indicators ON indicators.CODE = districts.CODE 
    GROUP BY districts.CODE
    ORDER BY NB_INCIDENTS DESC
"""

In [17]:
db_path = 'db/boston_crimes.sqlite'
conn = sqlite3.connect(db_path)
c = conn.cursor()

In [20]:
crimes_df = pd.read_sql_query(query, conn)
crimes_df

Unnamed: 0,CODE,NAME,NB_INCIDENTS,MEDIAN_AGE,TOTAL_POP,PERC_OF_30_34,PERC_MARRIED_COUPLE_FAMILY,PER_CAPITA_INCOME,PERC_OTHER_STATE_OR_ABROAD,PERC_LESS_THAN_HIGH_SCHOOL,PERC_COLLEGE_GRADUATES
0,B2,Roxbury,38877,32.5,54161,27.8,17.8,20978,2.9,23.0,18.9
1,C11,Dorchester,32875,33.4,126909,28.2,26.6,29767,2.4,18.0,17.1
2,D4,South End,31258,37.1,32571,33.9,28.3,83609,6.2,11.8,8.5
3,B3,Mattapan,28331,36.7,26659,20.9,29.8,28356,2.3,14.5,22.9
4,A1,Downtown,26260,33.5,18306,32.5,35.8,80057,14.8,15.4,6.9
5,C6,South Boston,16617,31.9,36772,46.1,24.7,64745,2.4,7.9,8.4
6,D14,Brighton,13788,30.8,55297,52.8,26.4,41261,8.6,6.7,10.5
7,E13,Jamaica Plain,12802,34.8,40867,32.5,33.7,51655,5.5,8.0,12.1
8,E18,Hyde Park,12551,39.4,38924,21.1,38.4,32744,1.9,13.8,21.3
9,A7,East Boston,9691,30.6,47263,31.1,30.4,31473,3.5,27.2,11.5


In [21]:
# check of the shape
crimes_df.shape

(12, 11)

## Linear Regression

As mentioned above, we want to investigate the impact of the socio-economic characteristics of the different Boston police districts on the number of incidents that are reported in these areas.

We are going to use the number of incidents as dependent variable
our regressors will be the various socio-economic indicators extracted from the database.

# Analysis for the police department

## Data manipulation

## Data visualisation

## Short presentation