# Report

## Introduction and data

### Subject Matter:

In the city of Chicago, many incidents/crimes happen every day, from minor thefts to murders. To reduce the violence in the city, the city wants to open a new crime prevention centre. Now the city is asking our team which crimes occur particularly frequently and where they happen. With this information, the **Crime Prevention Center** can be built in a particularly well-situated location. In addition, the specialised departments of the centre can be trained for the relevant criminal offences. This should make Chicago a safer city and ensure that measures are taken at an early stage to prevent crime. Therefore we would like to analyse where and when it makes the most sense to open the Crime Prevention Center to gain the most success.

### Motivation:
Parts of Chicago are one of the most violent and dangerous neighborhoods in the United States. We would like to help the city to prevent some crimes in the future and let Chicago live up to its potential!

Various studies show that it is possible to prevent crime in cities with the help of specific actions. With the new **Crime Prevention Center**, we want to take a new approach in Chicago to prevent crime from the very beginning.

* Crime Prevention and the Safer Cities Story
https://onlinelibrary.wiley.com/doi/10.1111/j.1468-2311.1993.tb00758.x

* Social Crime Prevention in South Africa's Major Cities 
http://csvr.org.za/docs/urbansafety/socialcrimeprevention.pdf




### General Question:

Which kind of crimes happen particularly frequently and where/when do they happen?

### Hypotheses:

There are neighborhoods in Chicago where the most (dangerous) crimes/incidents occur.
There is a certain time of a day when the most crime occur.

### Observations:

**Note:** As there were many unnecessary observations, we dropped these to keep our dataset as clean as possible.

* **Date** Date when the incident occurred.

* **Block**	The partially redacted address where the incident occurred, placing it on the same block as the actual address.

* **Primary Type** The primary description of the IUCR code.

* **Arrest** Indicates whether an arrest was made.	
	
* **District** Indicates the police district where the incident occurred. See the districts at https://data.cityofchicago.org/d/fthy-xz3r.	

* **Year** Year the incident occurred.	

* **Latitude** The longitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.	

* **Longitude** The longitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.

* **month** Month the incident occured

* **hour** Hour the incident occured

* **primary_group** We decided to combine the primary types into 3 groups: 

- Group_1: light to medium crimes
- Group_2: medium to serious crimes
- Group_3: homicide/murder

### How the data was collected:

**Note:** As the original dataset was too large, we have reduced it a little, so it only contains criminal cases from 2018 and 2019.

Crimes - 2001 to Present

This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. Data is extracted from the Chicago Police Department's CLEAR (Citizen Law Enforcement Analysis and Reporting) system. In order to protect the privacy of crime victims, addresses are shown at the block level only and specific locations are not identified. More information about this dataset: https://catalog.data.gov/dataset/crimes-2001-to-present

### EDA

The following section contains the EDA. Most of the data preparation and cleaning of the raw data was done separately.
- Data transformation and handling missing values. 
- Unnecessary observations and variables were deleted.
- The variable "date" was converted into "month", "hour" and "days". 
- The original 33 "primary_type" variables were converted into 10 "primary_type" variables and being renamed.
- Creation of smaller datasets.
- Preparation of the map-dataset for visualisation.



In [None]:
from pathlib import Path

PARENT_PATH = str(Path().resolve().parent) + "/"
PATH = "data/"
SUBPATH = "processed/"
FILE = "chicago_crimes-20230130-1108"
FORMAT = ".csv"

In [None]:
import altair as alt
from vega_datasets import data
alt.data_transformers.disable_max_rows()

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [None]:
import pandas as pd

df = pd.read_csv(PARENT_PATH + PATH + SUBPATH + FILE + FORMAT)

In [None]:
#drop observations which did not made it into our report
df.drop(['id', 'ward', 'day'], axis=1, inplace=True)

In [None]:
#Geopandas library to work with Chicago map
import geopandas as gpd

In [None]:
PARENT_PATH = str(Path().resolve().parent) + "/"
PATH = "data/"
SUBPATH = "external/"
FILE = "wards"
FORMAT = ".shp"

gdf = gpd.read_file(PARENT_PATH + PATH + SUBPATH + FILE + FORMAT)

In [None]:
df.head()

In [None]:
df.info()

### Description of the crimes
There were 76362 registered crimes in 2018 and 2019. "theft" was the most common with 21637 cases committed.

In [None]:
df["primary_type"].describe()

### Distribution of crimes over the two years (2018 + 2019)
The distribution of crimes are almost equal, there were slightly more crimes in 2018. 

In [None]:
chart_3 = alt.Chart(df).mark_bar().encode(
    y=alt.Y('year:N',
            axis=alt.Axis(title="YEAR",
                          titleY=15)),
    x=alt.X('count(primary_type)',
            axis=alt.Axis(title = "COUNT", 
                          titleAnchor="start")),
).properties(
    title='Count of committed crime per year',
    width=400,
    height=200
)


chart_3.configure_title(
    fontSize=16,
    font='Arial',
    color='black',
    anchor='start'
).configure_axis(grid=False
).configure_view(strokeOpacity=0)

### Distribution of crimes per year in months
Most crimes occour in spring/summer from May to August.

In [None]:
chart_4 = alt.Chart(df).mark_line().encode(
    x=alt.X('month:N',
            axis=alt.Axis(title="MONTH",
                          titleAnchor="start", 
                          labelAngle=0)),
    y=alt.Y('count(primary_type)',
            axis=alt.Axis(title = "COUNT", 
                          titleAnchor="end")),
    color=alt.Color("year:N", legend=alt.Legend(title="YEAR"))                  
).properties(
    title='Count of committed crime per month',
    width=600,
    height=400
).configure_axis(grid=False
).configure_view(strokeOpacity=0)

chart_4.configure_title(
    fontSize=16,
    font='Arial',
    color='black',
    anchor='start'
)

### Distribution of crimes per day in hours
- Most crimes occour during the day.
- Constant increase from 8 am to 7 pm.
- Peak at 12pm
- Constant decrease from 7pm to 1 am.
- Lowest at 5 am

In [None]:
chart_5 = alt.Chart(df).mark_line().encode(
    x=alt.X('hour:N',
            axis=alt.Axis(title="HOUR",
                          titleAnchor="start",
                          labelAngle=0)),
    y=alt.Y('count(primary_type)',
            axis=alt.Axis(title = "COUNT", 
                          titleAnchor="end")),
    color=alt.Color("year:N", legend=alt.Legend(title="YEAR"))
).properties(
    title='Count of committed crime per hour',
    width=600,
    height=400
).configure_axis(grid=False
).configure_view(strokeOpacity=0)


chart_5.configure_title(
    fontSize=16,
    font='Arial',
    color='black',
    anchor='start'
)

In [None]:
#top 5 hours
df["hour"].value_counts().nlargest(5)

### Distribution of the crimes by type

- most: "theft" and "assault_and_battery"
- least: "sexual_crime" and "homicide"

In [None]:
df.primary_type.value_counts()

In [None]:
ch = alt.Chart(df).mark_bar().encode(
    x=alt.X("primary_type", sort="-y",
            axis=alt.Axis(title="DISTRICT",
                          titleAnchor="start", 
                          labelAngle=0)),
    y=alt.Y('count(primary_type)', 
            axis=alt.Axis(title = "TYPE", 
                        titleAnchor="end"),
                        scale=alt.Scale(domain=[0, 24000])),
).properties(
    title='Count of committed crimes per type',
    width=1000,
    height=400
)

txt = ch.mark_text(
    baseline = 'middle',
    dy= - 15
).encode(
    text='count(primary_type)'
)

layer = alt.layer(ch + txt
).configure_title(
    fontSize=16,
    font='Arial',
    color='black',
    anchor='start'
).configure_axis(grid=False
).configure_view(strokeOpacity=0)

layer

### Distribution of crimes per block/street

The 5 streets with the most reported crimes are:
- 1. State Street (1575)
- 2. Michigan Avenue (1520)
- 3. Halsted Street (1049)
- 4. Ashland Avenue (925)
- 5. Clark Street (845)

In [None]:
df["block"].value_counts().nlargest(10)

### Distribution of crime type per disctrict in percent

In this crosstable we can see how many crimes (in %) happen in each district.

For example: 2,97% of crimes are theft in the District 1

In [None]:
#primary_type and district crosstab

cross_table = pd.crosstab(df["primary_type"], df["district"],
    margins=True,
    normalize=True,
    rownames=["Type"],
    colnames=["District"]
    )* 100


cross_table

Here we can see in which districts most crime happen (in %).
The most violent districts are:
- 1. District 11 (7,1 % of all crime)
- 2. District 6 (6,2 %)
- 3. District 8 (6,0 %)
- 4. District 1 (6,0 %)
- 5. District 18 (5,9%)

In [None]:
#top 5 districts
df["district"].value_counts(normalize=True).nlargest(5) * 100

### Distribution of crime per disctrict in total

Top 5 Districts:

In [None]:
df["district"].value_counts().nlargest(5)

In [None]:
district = alt.Chart(df).mark_bar().encode(
    x=alt.X("district:N",
    sort="-y",
    axis=alt.Axis(title="DISTRICT",  
                          titleAnchor="start", 
                          labelAngle=0, grid=False)),
    y=alt.Y("count(primary_type)",
    axis=alt.Axis(title = "COUNT", 
                        titleAnchor="end")),
    tooltip=['district', alt.Tooltip('count(primary_type)', title='count')]
).properties(
    title='Count of committed crimes per district',
    width=1000,
    height=400
).configure_axis(grid=False
).configure_view(strokeOpacity=0)



district.configure_title(
    fontSize=16,
    font='Arial',
    color='black',
    anchor='start'
).configure_axis(grid=False
).configure_view(strokeOpacity=0)

### Distribution of crime groups in total

Since 10 single crime types were too many for some purposes (map, etc.), we decided to combine the types into groups.

- Group_1: light to medium crimes
- Group_2: medium to serious crimes
- Group_3: homicide/murder


In [None]:
#Barchart with different groups of crime
# we bulit these groups sorted according to gravity

ch = alt.Chart(df).mark_bar().encode(
    y=alt.Y('count(primary_group)', 
            axis=alt.Axis(title = "COUNT", 
                        titleAnchor="end")),
    x=alt.X("primary_group", sort="-y",
            axis=alt.Axis(title="GROUP",
                            labelAngle=0,
                          titleAnchor="start")),

).properties(
    title='Count of committed crimes per group',
    width=800,
    height=400
)

txt = ch.mark_text(
    baseline = 'middle',
    dy = -15
).encode(
    text='count(primary_type)'
)

layer = alt.layer(ch + txt
).configure_title(
    fontSize=16,
    font='Arial',
    color='black',
    anchor='start'
).configure_axis(grid=False
).configure_view(strokeOpacity=0)

layer


### Arrest

Here we can see how many of the different crimes let to an arrest. 
- Only 62 out of 159 homocides let to an arrest.
- The rate for arrest for narcotics is extremely high: 4167 out of 4170 incidents led to an arrest.

In [None]:
#arrest crime crosstab in total

cross_table = pd.crosstab(df["primary_type"], df["arrest"],
    margins=True,
    normalize=False,
    rownames=["Crime"],
    colnames=["Arrest"]
    )


cross_table

## Visualizations

The first visualisation is a map of Chicago City in which the individual crimes of the years 2018 and 2019 have been displayed as small squares. The colour sorting was done according to the primary_group. Since the density of the squares was reduced, it is easy to see in which districts the most crimes took place. It is also possible to filter according to one of the three individual primary_group. The map allows the viewer to quickly get an overview of the local frequency and degree of the crimes.

The second visualisation is a line chart which shows the day in 24 hours. This graph also shows the distribution of the cases within a day, represented for the years 2018 and 2019. The whole chart is also interactive. It is important for us to know at what time the crimes were committed in order to adapt the opening hours of the Crime Prevention Center to the local conditions.

The third visualization contains two plots. The first one shows the number of crimes committed per district, in descending order. This is to show the viewer quickly and easily in which districts the most crimes were committed.
To get an even more detailed insight into the crimes committed in the individual districts, the second stacked bar plot was added. This plot shows the individual crimes by type in percentage frequency. This should help to evaluate the crimes in the districts not only by frequency, but also by severity.

In all visualisations, care was taken to ensure that the viewer can understand the message of the graphic as easily as possible.
Unnecessary distractions such as grids or frames were avoided. The axes were labelled in such a way that they support the viewer's eye. The interactivity of the charts encourages the viewer to engage with the visualisations and to discover while using the chart.



### Map (interactive)
In this Map you can see the location of the crimes that were occured in Chicago. The more intense the point on the map is, the more crimes were made there. You can find out the exact location, street and type with your mouse. You can also filter the view by one of the three primary_groups.

In [None]:
#map of chicago
choro = alt.Chart(gdf).mark_geoshape(
    fill="white", stroke='grey'
).encode()

#selection
group_radio = alt.binding_radio(options=['group_1','group_2','group_3'], name='Select_Group: ')
group_select = alt.selection_single(
    fields=["primary_group"], bind=group_radio
)

group_color_condition = alt.condition(
    group_select,
    alt.Color("primary_group:N"),
    alt.value("lightgrey"),
)

#squares
p = alt.Chart(df).mark_square(opacity=0.3).encode(
        longitude='longitude', 
        latitude='latitude', 
        size=alt.value(10), 
        tooltip=["district", "block", "primary_type"]
).add_selection(group_select
).encode(color=group_color_condition
).properties(
    title="Locations of crimes in Chicago City",
    width=800,
    height=1000)
    
layer = alt.layer(choro + p
).configure_title(
    fontSize=16,
    font='Arial',
    color='black',
    anchor='start'
).configure_axis(grid=False
).configure_view(strokeOpacity=0)

layer


### Line chart (interactive)
Here we can see the difference between the year 2018 and the year 2019. This Graph shows at what time the crime was commited. We can see that the most crimes were commited at 12pm and at 7pm. 


In [None]:

# select a point for which to provide details-on-demand
label = alt.selection_single(
    encodings=['x'], # limit selection to x-axis value
    on='mouseover',  # select on mouseover events
    nearest=True,    # select data point nearest the cursor
    empty='none'     # empty selection includes no data points
)

chart_5 = alt.Chart().mark_line().encode(
    x=alt.X('hour:N',
            axis=alt.Axis(title="HOUR",
                          titleAnchor="start",
                          labelAngle=0)),
    y=alt.Y('count(primary_type)',
            axis=alt.Axis(title = "COUNT", 
                          titleAnchor="end")),
    color=alt.Color("year:N", legend=alt.Legend(title=" ", orient='none', legendX=820, legendY=180))
)


alt.layer(
    chart_5,
    alt.Chart().mark_rule(color='lightgrey').encode(
        x='hour:N'
    ).transform_filter(label),

chart_5.mark_circle().encode(
        opacity=alt.condition(label, alt.value(1), alt.value(0))
    ).add_selection(label),

chart_5.mark_text(align='left', dx=5, dy=-5, stroke='white', strokeWidth=2).encode(
        text='count(primary_type)'
    ).transform_filter(label),

chart_5.mark_text(align='left', dx=5, dy=-5).encode(
        text='count(primary_type)'
    ).transform_filter(label),
    data=df
).properties(
    title="Distribution of committed crimes per hour",
    width=800,
    height=600
).configure_title(
    fontSize=16,
    font='Arial',
    color='black',
    anchor='start'
).configure_axis(grid=False
).configure_view(strokeOpacity=0)

### Bar chart & stacked bar chart (interactive: drag to select single or multiple districts)
- In the first bar chart we can see again in which districts the most crimes were commited. The most violent districts are: 11, 6, 8, 1, 18 and the least violent are: 20, 17 and 24.
- In the stacked bar chart we can see how the different types of crime are distributed in the districts. For example there were 2271 thefts in District 1.

In [None]:
order_crime = ["theft", "assault_and_battery","criminal_damage", "deceptive_practice", "burglary", "other_offense", "robbery_and_weapons", "narcotics", "homicide", "sexual_crime"]

brush = alt.selection(type='interval')

bar = alt.Chart(df).mark_bar().encode(
    x=alt.X("district:N",
    sort="-y",
    axis=alt.Axis(title="DISTRICT",  
                          titleAnchor="start", 
                          labelAngle=0)),
    y=alt.Y("count(primary_type):Q",
    axis=alt.Axis(title="COUNT",  
                          titleAnchor="end")),
    tooltip=[alt.Tooltip('count(primary_type)', title='count')]
).add_selection(
    brush
).properties(
    title='Count of committed crime per districts',
    width=1000,
    height=400
)



bars = alt.Chart(df).mark_bar().encode(
    x=alt.X('count(primary_type)', stack="normalize",
    axis=alt.Axis(format="%",title = "DISTRIBUTION", 
                          titleAnchor="start")),
    y=alt.Y('district:N',
    axis=alt.Axis(title="DISTRICT",  
                          titleY=25)),
    color=alt.Color('primary_type', sort=order_crime, 
    legend=alt.Legend(title="TYPE", orient='none', legendX=1100, legendY=480)),
    tooltip=["primary_type", alt.Tooltip('count(primary_type)', title='count')]
).transform_filter(
    brush
).properties(
    title='Distribution of crime types per district',
    width=1000,
    height=600)

alt.vconcat(bar & bars).configure_title(
    fontSize=16,
    font='Arial',
    color='black',
    anchor='start'
).configure_axis(grid=False
).configure_view(strokeOpacity=0)


## Conclusion + recommended action


> REMOVE THE FOLLOWING TEXT

In this section you'll include a summary of what you have learned about your (research) question along with (statistical) arguments supporting your conclusions.

In addition, discuss the limitations of your analysis and provide suggestions on ways the analysis could be improved. (Hendrik)

Any potential issues pertaining to the reliability and validity of your data and appropriateness of the statistical analysis should also be discussed here. (Hendrik)

Lastly, this section will include your recommended action. (Esad)

We can clearly see that Chicago is a very violent city and that the city council needs to react or prevent. 

Therefore we still suggest Crime Prevention Center in the districts. We understand that it is to expensive to open them in every district. If we think about our analysis and our graphs we would open at least 5 Center in the most violent districts:

- 1. District 11 (5432 incidents)
- 2. District 6 (4712)
- 3. District 8 (4591)
- 4. District 1 (4560)
- 5. District 18 (4485)

We also understand that the Prevention Center can't be open 24 hours a day, so we would adapt their opening hours to the time when the most crime happen statisticly. We would suggest opening times at least from 11am to 8 pm because this are the most violent hours, especially at 12pm and 7 pm. The 5 most violent hours are:

- 1. 12pm   (4663 incidents)
- 2. 7pm    (4413)
- 3. 6pm    (4383)
- 4. 3pm    (4226)
- 5. 5pm    (4225)

If we think about the different months over the year, we would still recommend too keep the Prevention Center open the whole year, because there are no months with a significant decrease of crime.

