# Report

## Introduction and data


*All of the EDA won't fit in the paper, so focus on the EDA for the response variable and a few other interesting variables and relationships.*

////
### Subject Matter:

In the city of Chicago, many incidents/crimes happen every day, from minor thefts to murders. To reduce the violence in the city, the city wants to open a new crime prevention centre. Now the city is asking our team which crimes occur particularly frequently and where they happen. With this information, the **Crime Prevention Center** can be built in a particularly well-situated location. In addition, the specialised departments of the centre can be trained for the relevant criminal offences. This should make Chicago a safer city and ensure that measures are taken at an early stage to prevent crime. Therefore we would like to analyse where and when it makes the most sense to open the Crime Prevention Center to gain the most success.

### Motivation:
Parts of Chicago are one of the most violent and dangerous neighborhoods in the United States. We would like to help the city to prevent some crimes in the future and let Chicago live up to its potential!

Various studies show that it is possible to prevent crime in cities with the help of specific actions. With the new **Crime Prevention Center**, we want to take a new approach in Chicago to prevent crime from the very beginning.

* Crime Prevention and the Safer Cities Story
https://onlinelibrary.wiley.com/doi/10.1111/j.1468-2311.1993.tb00758.x

* Social Crime Prevention in South Africa's Major Cities 
http://csvr.org.za/docs/urbansafety/socialcrimeprevention.pdf




### General Question:

Which kind of crimes happen particularly frequently and where do they happen?

### Hypotheses:

There are neighborhoods in Chicago where the most (dangerous) crimes/incidents occur.
There is a certain time of a day when the most crime occur.


In [None]:
from pathlib import Path

PARENT_PATH = str(Path().resolve().parent) + "/"
PATH = "data/"
SUBPATH = "processed/"
FILE = "chicago_crimes-20230130-1108"
FORMAT = ".csv"

In [None]:
import altair as alt
from vega_datasets import data
alt.data_transformers.disable_max_rows()

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [None]:
import pandas as pd

df = pd.read_csv(PARENT_PATH + PATH + SUBPATH + FILE + FORMAT)

In [None]:
df.head()

In [None]:
df.info()

### Description of the crimes
There were 76362 registered crimes in 2018 and 2019. "theft" was the most common with 21637 cases committed.

In [None]:
df["primary_type"].describe()

### Distribution of crimes over the two years (2018 + 2019)
The distribution of crimes are almost equal, there were slightly more crimes in 2018. 

In [None]:
chart_3 = alt.Chart(df).mark_bar().encode(
    y=alt.Y('year:N',
            axis=alt.Axis(title="YEAR",
                          titleY=15)),
    x=alt.X('count(primary_type)',
            axis=alt.Axis(title = "COUNT", 
                          titleAnchor="start")),
).properties(
    title='Count of commited crime per year',
    width=400,
    height=200
)


chart_3.configure_title(
    fontSize=16,
    font='Arial',
    color='black',
    anchor='start'
).configure_axis(grid=False
).configure_view(strokeOpacity=0)

### Distribution of crimes per year in months
Most crimes occour in spring/summer from May to August.

In [None]:
chart_4 = alt.Chart(df).mark_line().encode(
    x=alt.X('month:N',
            axis=alt.Axis(title="MONTH",
                          titleAnchor="start", 
                          labelAngle=0)),
    y=alt.Y('count(primary_type)',
            axis=alt.Axis(title = "COUNT", 
                          titleAnchor="end")),
    color=alt.Color("year:N", legend=alt.Legend(title="YEAR"))                  
).properties(
    title='Count of commited crime per month',
    width=600,
    height=400
).configure_axis(grid=False
).configure_view(strokeOpacity=0)

chart_4.configure_title(
    fontSize=16,
    font='Arial',
    color='black',
    anchor='start'
)

### Distribution of crimes per day in hours
- Most crimes occour during the day.
- Constant increase from 8 am to 7 pm.
- Peak at 12pm
- Constant decrease from 7pm to 1 am.
- Lowest at 5 am

In [None]:
chart_5 = alt.Chart(df).mark_line().encode(
    x=alt.X('hour:N',
            axis=alt.Axis(title="HOUR",
                          titleAnchor="start",
                          labelAngle=0)),
    y=alt.Y('count(primary_type)',
            axis=alt.Axis(title = "COUNT", 
                          titleAnchor="end")),
    color=alt.Color("year:N", legend=alt.Legend(title="YEAR"))
).properties(
    title='Count of commited crime per hour',
    width=600,
    height=400
).configure_axis(grid=False
).configure_view(strokeOpacity=0)


chart_5.configure_title(
    fontSize=16,
    font='Arial',
    color='black',
    anchor='start'
)

In [None]:
df["hour"].value_counts()

### Distribution of the crimes by type

- most: "theft" and "assault_and_battery"
- least: "sexual_crime" and "homicide"

In [None]:
df.primary_type.value_counts()

In [None]:
ch = alt.Chart(df).mark_bar().encode(
    x=alt.X("primary_type", sort="-y",
            axis=alt.Axis(title="DISTRICT",
                          titleAnchor="start", 
                          labelAngle=0)),
    y=alt.Y('count(primary_type)', 
            axis=alt.Axis(title = "TYPE", 
                        titleAnchor="end"),
                        scale=alt.Scale(domain=[0, 24000])),
).properties(
    title='Count of commited crimes per type',
    width=1000,
    height=400
)

txt = ch.mark_text(
    baseline = 'middle',
    dy= - 15
).encode(
    text='count(primary_type)'
)

layer = alt.layer(ch + txt
).configure_title(
    fontSize=16,
    font='Arial',
    color='black',
    anchor='start'
).configure_axis(grid=False
).configure_view(strokeOpacity=0)

layer

### Distribution of crimes per block/street

The 5 streets with the most reported crimes are:
1. State Street (1575)
2. Michigan Avenue (1520)
3. Halsted Street (1049)
4. Ashland Avenue (925)
5. Clark Street (845)

In [None]:
df["block"].value_counts().nlargest(10)

In [None]:
display(df[(df['district']==1) & (df['block'] == "state_st") & (df['primary_group'] == "group_1")]) #951 Fälle für state street in district 1

In [None]:
display(df[(df['block'] == "ashland_ave")]) # kedzie_ave 92, pulaski_rd 213, western_ave 27, madison_st 274

### Distribution of crime type per disctrict in percent

In this crosstable we can see how many crimes (in %) happen in each district.

For example: 2,97% of crimes are theft in the District 1

In [None]:
#primary_type and district crosstab

cross_table = pd.crosstab(df["primary_type"], df["district"],
    margins=True,
    normalize=True,
    rownames=["Type"],
    colnames=["District"]
    )* 100


cross_table

Here we can see in which districts most crime happen (in %).
The most violent districts are:
1. District 11 (7,1 % of all crime)
2. District 6 (6,2 %)
3. District 8 (6,0 %)
4. District 1 (6,0 %)
5. District 18 (5,9%)

In [None]:
df["district"].value_counts(normalize=True) * 100

### Distribution of crime per disctrict in total

Top 5 Districts:

In [None]:
df["district"].value_counts().nlargest(5)

In [None]:
district = alt.Chart(df).mark_bar().encode(
    x=alt.X("district:N",
    sort="-y",
    axis=alt.Axis(title="DISTRICT",  
                          titleAnchor="start", 
                          labelAngle=0, grid=False)),
    y=alt.Y("count(primary_type)",
    axis=alt.Axis(title = "COUNT", 
                        titleAnchor="end")),
    tooltip=['district', alt.Tooltip('count(primary_type)', title='count')]
).properties(
    title='Count of commited crimes per district',
    width=1000,
    height=400
).configure_axis(grid=False
).configure_view(strokeOpacity=0)



district.configure_title(
    fontSize=16,
    font='Arial',
    color='black',
    anchor='start'
).configure_axis(grid=False
).configure_view(strokeOpacity=0)

### Distribution of crime groups in total

Since 10 single crime types were too many for some purposes (map, etc.), we decided to combine the types into groups.

- Group_1: light to medium crimes
- Group_2: medium to serious crimes
- Group_3: homicide/murder


In [None]:
#Barchart with different groups of crime
# we bulit these groups sorted according to gravity

ch = alt.Chart(df).mark_bar().encode(
    y=alt.Y('count(primary_group)', 
            axis=alt.Axis(title = "COUNT", 
                        titleAnchor="end")),
    x=alt.X("primary_group", sort="-y",
            axis=alt.Axis(title="GROUP",
                            labelAngle=0,
                          titleAnchor="start")),

).properties(
    title='Count of commited crimes per group',
    width=800,
    height=400
)

txt = ch.mark_text(
    baseline = 'middle',
    dy = -15
).encode(
    text='count(primary_type)'
)

layer = alt.layer(ch + txt
).configure_title(
    fontSize=16,
    font='Arial',
    color='black',
    anchor='start'
).configure_axis(grid=False
).configure_view(strokeOpacity=0)

layer


### Arrest

Here we can see how many of the different crimes let to an arrest. Only 62 out of 159 homocides let to an arrest.

In [None]:
#arrest crime crosstab in total

cross_table = pd.crosstab(df["primary_type"], df["arrest"],
    margins=True,
    normalize=False,
    rownames=["Crime"],
    colnames=["Arrest"]
    )


cross_table

## Visualizations

In [None]:
#Geopandas library to work with Chicago map
import geopandas as gpd

In [None]:
PARENT_PATH = str(Path().resolve().parent) + "/"
PATH = "data/"
SUBPATH = "external/"
FILE = "wards"
FORMAT = ".shp"

gdf = gpd.read_file(PARENT_PATH + PATH + SUBPATH + FILE + FORMAT)

In this Map you can see the location of the crimes that were occured in Chicago. The more intense the point on the map is, the more crimes were made there. You can find out the exact location with your mouse.

In [None]:
#Map of Chicago with Crimes as Dots on the Map


choro = alt.Chart(gdf).mark_geoshape(
    fill="white", stroke='grey'
).encode()

input_radio = alt.binding_radio(options=['group_1','group_2','group_3'], name='Select_Group: ')
selection = alt.selection_single(fields=['primary_group'], bind=input_radio)

p = alt.Chart(df).mark_square(opacity=0.3).encode(
    longitude='longitude',
    latitude='latitude',
    size=alt.value(10),
    color="primary_group:N",
    tooltip=["district", "block", "primary_type"]
).add_selection(
    selection
).transform_filter(
    selection
).properties(
    title="Location of crimes in Chicago City",
    width=1000,
    height=1000
)


layer = alt.layer(choro + p
).configure_title(
    fontSize=16,
    font='Arial',
    color='black',
    anchor='start'
).configure_axis(grid=False)

layer


In [None]:

# select a point for which to provide details-on-demand
label = alt.selection_single(
    encodings=['x'], # limit selection to x-axis value
    on='mouseover',  # select on mouseover events
    nearest=True,    # select data point nearest the cursor
    empty='none'     # empty selection includes no data points
)

chart_5 = alt.Chart().mark_line().encode(
    x=alt.X('hour:N',
            axis=alt.Axis(title="HOUR",
                          titleAnchor="start",
                          labelAngle=0)),
    y=alt.Y('count(primary_type)',
            axis=alt.Axis(title = "COUNT", 
                          titleAnchor="end")),
    color=alt.Color("year:N", legend=alt.Legend(title="YEAR"))
)


alt.layer(
    chart_5,
    alt.Chart().mark_rule(color='lightgrey').encode(
        x='hour:N'
    ).transform_filter(label),

chart_5.mark_circle().encode(
        opacity=alt.condition(label, alt.value(1), alt.value(0))
    ).add_selection(label),

chart_5.mark_text(align='left', dx=5, dy=-5, stroke='white', strokeWidth=2).encode(
        text='count(primary_type)'
    ).transform_filter(label),

chart_5.mark_text(align='left', dx=5, dy=-5).encode(
        text='count(primary_type)'
    ).transform_filter(label),
    data=df
).properties(
    title="Distribution of crimes per day in hours",
    width=800,
    height=600
).configure_title(
    fontSize=16,
    font='Arial',
    color='black',
    anchor='start'
).configure_axis(grid=False
).configure_view(strokeOpacity=0)

Here we can see the difference between the year 2018 and the year 2019. This Graph shows at what time the crime was commited. We can see that the most crimes were commited at 12pm and at 7pm. 

In [None]:
district_5 = alt.Chart(df).mark_bar().encode(
    x=alt.X("district:N",
    sort="-y",
    axis=alt.Axis(title="DISTRICT",  
                          titleAnchor="start", 
                          labelAngle=0)),
    y=alt.Y("count(primary_type):Q",
    axis=alt.Axis(title="COUNT",  
                          titleAnchor="end")),
    color=alt.condition(
        alt.FieldOneOfPredicate('district', [11, 6, 8, 1, 18]),  # If the district is 11 this test returns True,
        alt.value('orange'),     # which sets the bar orange.
        alt.value('steelblue')   # And if it's not true it sets the bar steelblue.
    ),
    tooltip=["count(primary_type)"]
).properties(
    title='Count of commited crime in the districts',
    width=1000,
    height=400
).configure_title(
    fontSize=16,
    font='Arial',
    color='black',
    anchor='start'
).configure_axis(grid=False
).configure_view(strokeOpacity=0)

district_5

In [None]:
df.primary_type.value_counts()

Here we can see again in which districts the most crimes were commited. The most violent districts are: 11, 6, 8, 1, 18 and the least violent are: 20, 17 and 24.

In [None]:
#primary_type per district

order_crime = ["theft", "assault_and_battery","criminal_damage", "deceptive_practice", "burglary", "other_offense", "robbery_and_weapons", "narcotics", "homicide", "sexual_crime"]

alt.Chart(df).mark_bar().encode(
    x=alt.X('count(primary_type)', stack="normalize",
    axis=alt.Axis(format="%",title = "PERCENT", 
                          titleAnchor="start")),
    y=alt.Y('district:N',
    axis=alt.Axis(title = "DISTRICT", 
                          titleY=25)),
    color=alt.Color('primary_type', sort=order_crime),
    tooltip=["primary_type", alt.Tooltip('count(primary_type)', title='count')]
).properties(
    title='Distribution of crime types per district',
    width=1000,
    height=800
).configure_title(
    fontSize=16,
    font='Arial',
    color='black',
    anchor='start'
).configure_axis(grid=False
).configure_view(strokeOpacity=0)

In [None]:
brush = alt.selection(type='interval')

bar = alt.Chart(df).mark_bar().encode(
    x=alt.X("district:N",
    sort="-y",
    axis=alt.Axis(title="DISTRICT",  
                          titleAnchor="start", 
                          labelAngle=0)),
    y=alt.Y("count(primary_type):Q",
    axis=alt.Axis(title="COUNT",  
                          titleAnchor="end")),
    tooltip=[alt.Tooltip('count(primary_type)', title='count')]
).add_selection(
    brush
).properties(
    title='Count of commited crime in the districts',
    width=1000,
    height=400
)



bars = alt.Chart(df).mark_bar().encode(
    x=alt.X('count(primary_type)', stack="normalize",
    axis=alt.Axis(format="%",title = "PERCENT", 
                          titleAnchor="start")),
    y=alt.Y('district:N',
    axis=alt.Axis(title="DISTRICT",  
                          titleY=25)),
    color=alt.Color('primary_type', sort=order_crime, 
    legend=alt.Legend(orient='none', legendX=1100, legendY=480)),
    tooltip=["primary_type", alt.Tooltip('count(primary_type)', title='count')]
).transform_filter(
    brush
).properties(
    title='Distribution of crime types per district',
    width=1000,
    height=600)

alt.vconcat(bar & bars).configure_title(
    fontSize=16,
    font='Arial',
    color='black',
    anchor='start'
).configure_axis(grid=False
).configure_view(strokeOpacity=0)


In this graph we can see how the different types of crime are distributet in the districts. For example there were 2271 thefts in District 1.

## Conclusion + recommended action


> REMOVE THE FOLLOWING TEXT

In this section you'll include a summary of what you have learned about your (research) question along with (statistical) arguments supporting your conclusions.

In addition, discuss the limitations of your analysis and provide suggestions on ways the analysis could be improved. (Hendrik)

Any potential issues pertaining to the reliability and validity of your data and appropriateness of the statistical analysis should also be discussed here. (Hendrik)

Lastly, this section will include your recommended action. (Esad)

We can clearly see that Chicago is a very violent city and that the city council needs to react or prevent. 

Therefore we still suggest Crime Prevention Center in the districts. We understand that it is to expensive to open them in every district. If we think about our analysis and our graphs we would open at least 5 Center in the most violent districts:
1. District 11 (5432 incidents)
2. District 6 (4712)
3. District 8 (4591)
4. District 1 (4560)
5. District 18 (4485)

We also understand that the Prevention Center can't be open 24 hours a day, so we would adapt their opening hours to the time when the most crime happen statisticly. We would suggest opening times at least from 11am to 8 pm because this are the most violent hours, especially at 12pm and 7 pm. The 5 most violent hours are:
1. 12pm   (4663 incidents)
2. 7pm    (4413)
3. 6pm    (4383)
4. 3pm    (4226)
5. 5pm    (4225)

If we think about the different months over the year, we would still recommend too keep the Prevention Center open the whole year, because there are no months with a significant decrease of crime.

