<a href="https://colab.research.google.com/github/po1itepeop1e/Fundamentals-of-Data-Visualization/blob/main/NY_Crime_Data_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:

# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES
# TO THE CORRECT LOCATION (/kaggle/input) IN YOUR NOTEBOOK,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

import os
import sys
from tempfile import NamedTemporaryFile
from urllib.request import urlopen
from urllib.parse import unquote, urlparse
from urllib.error import HTTPError
from zipfile import ZipFile
import tarfile
import shutil

CHUNK_SIZE = 40960
DATA_SOURCE_MAPPING = ':https%3A%2F%2Fstorage.googleapis.com%2Fkaggle-data-sets%2F33945%2F85901%2Fbundle%2Farchive.zip%3FX-Goog-Algorithm%3DGOOG4-RSA-SHA256%26X-Goog-Credential%3Dgcp-kaggle-com%2540kaggle-161607.iam.gserviceaccount.com%252F20240511%252Fauto%252Fstorage%252Fgoog4_request%26X-Goog-Date%3D20240511T042441Z%26X-Goog-Expires%3D259200%26X-Goog-SignedHeaders%3Dhost%26X-Goog-Signature%3D9a0930688838e6cc2191421e3a132ddffacce554d8100655e4d764e4d86cf5f84efe5121514b2c116dc4211947bd6f3e590b33d4576edffd3d0b3706d50f37d7aa657c0b12c9def07cf280aa41fa4f4b92e2eb238f12fdc5f588d7d4d411d64f97af2933c20be015dc10843bf982adfe23031492203481b889660afd66748a40aca0fa1e7c1669ea20f85ce845be2e101f88114c582bd0b7f4101557b905b182caf6eb64930738a864b413a2e190c56a3ef9afbbca105d2ed2dc3027b859cebbad309433c6a8566c9a644be88a1b15212be34d9a9e66e10b3c6967690729ce93790ecafe1b5d8d5e36f9c7c9d0ad56770da8f621fc03fe922f1613af85d95ec1'

KAGGLE_INPUT_PATH='/kaggle/input'
KAGGLE_WORKING_PATH='/kaggle/working'
KAGGLE_SYMLINK='kaggle'

!umount /kaggle/input/ 2> /dev/null
shutil.rmtree('/kaggle/input', ignore_errors=True)
os.makedirs(KAGGLE_INPUT_PATH, 0o777, exist_ok=True)
os.makedirs(KAGGLE_WORKING_PATH, 0o777, exist_ok=True)

try:
  os.symlink(KAGGLE_INPUT_PATH, os.path.join("..", 'input'), target_is_directory=True)
except FileExistsError:
  pass
try:
  os.symlink(KAGGLE_WORKING_PATH, os.path.join("..", 'working'), target_is_directory=True)
except FileExistsError:
  pass

for data_source_mapping in DATA_SOURCE_MAPPING.split(','):
    directory, download_url_encoded = data_source_mapping.split(':')
    download_url = unquote(download_url_encoded)
    filename = urlparse(download_url).path
    destination_path = os.path.join(KAGGLE_INPUT_PATH, directory)
    try:
        with urlopen(download_url) as fileres, NamedTemporaryFile() as tfile:
            total_length = fileres.headers['content-length']
            print(f'Downloading {directory}, {total_length} bytes compressed')
            dl = 0
            data = fileres.read(CHUNK_SIZE)
            while len(data) > 0:
                dl += len(data)
                tfile.write(data)
                done = int(50 * dl / int(total_length))
                sys.stdout.write(f"\r[{'=' * done}{' ' * (50-done)}] {dl} bytes downloaded")
                sys.stdout.flush()
                data = fileres.read(CHUNK_SIZE)
            if filename.endswith('.zip'):
              with ZipFile(tfile) as zfile:
                zfile.extractall(destination_path)
            else:
              with tarfile.open(tfile.name) as tarfile:
                tarfile.extractall(destination_path)
            print(f'\nDownloaded and uncompressed: {directory}')
    except HTTPError as e:
        print(f'Failed to load (likely expired) {download_url} to path {destination_path}')
        continue
    except OSError as e:
        print(f'Failed to load {download_url} to path {destination_path}')
        continue

print('Data source import complete.')


Downloading , 221747 bytes compressed
Downloaded and uncompressed: 
Data source import complete.


# ***1. brief recap***

* *Data:*

The dataset contains information about arrests, specifically focusing on individuals who were 18 years or older at the time of the crime. The data includes arrests for felony offenses, misdemeanors defined in the penal law, misdemeanors outside the penal law that would be considered felonies with a previous conviction, and loitering for prostitution.



* *Goals*

It's crucial for law enforcement agencies to adapt their strategies based on the spatial and temporal dynamics of crime, as this can help in the allocation of resources to high-risk areas and times. Understanding the geographical distribution of different types of crimes can also assist in implementing targeted interventions in specific regions.

By considering the temporal variations in crime occurrence, such as daily, weekly, and yearly patterns, law enforcement agencies can better plan their surveillance and patrol activities to manage peaks in criminal activities effectively.


* *Key Points:*

The dataset contains 3055 observations and 13 variables. The variables include "County" (county where the crime was recorded), "Year" (year the crime occurred), and "Total" (total number of adult felony and misdemeanor arrests). The breakdown of arrests includes categories such as felonies (Drug, Violent, DWI, Other) and misdemeanors (Total, Drug, DWI, Property, Other).



* *Tasks:*

Common Types of Crimes: Identifying the most prevalent types of crimes in the dataset

Geographical Distribution: Determining how different types of crimes are spatially concentrated across various regions

Temporal Variations: Investigating the changes in crime frequency over different time scales, such as daily, weekly, and annually

In [4]:
!pip install "altair[all]"



In [6]:
import numpy as np
import pandas as pd
import altair as alt

df=pd.read_csv("../input/adult-arrests-by-county-beginning-1970.csv")
df.head()

Unnamed: 0,County,Year,Total,Felony Total,Drug Felony,Violent Felony,DWI Felony,Other Felony,Misdemeanor Total,Drug Misd,DWI Misd,Property Misd,Other Misd
0,Albany,1970,1226,688,97,191,5,395,538,207,48,95,188
1,Albany,1971,1833,829,131,231,6,461,1004,204,111,272,417
2,Albany,1972,3035,1054,211,256,8,579,1981,285,297,541,858
3,Albany,1973,3573,1134,244,274,28,588,2439,369,497,668,905
4,Albany,1974,4255,1329,281,308,17,723,2926,437,619,885,985


In [8]:
import csv
import json
import re
import numpy as np
import pandas as pd
import altair as alt

from collections import Counter, OrderedDict
from IPython.display import HTML

**Finding if dataset contains missing values**

In [10]:
df.isna().sum()

County               0
Year                 0
Total                0
Felony Total         0
Drug Felony          0
Violent Felony       0
DWI Felony           0
Other Felony         0
Misdemeanor Total    0
Drug Misd            0
DWI Misd             0
Property Misd        0
Other Misd           0
dtype: int64

*Geographical Distribution:*

In [23]:
alt.Chart(df).mark_bar().encode(x="County", y="Total")

In [64]:
# Implementing selection

In [63]:
selection = alt.selection(type='multi', fields=['Region'], on='mouseover', nearest=True)

alt.Chart(df).mark_circle().encode(
    x = "Felony Total",
    y    = "Total",
    color=alt.Color('County', scale=alt.Scale(scheme='spectral')),
    size="Felony Total",
    tooltip=["County", "Total"],
    opacity=alt.condition(selection,alt.value(1),alt.value(.2))
).add_selection(selection)

In [77]:
# Store the SPLOM
chart = alt.Chart(df).mark_circle().encode(
    x = "Felony Total",
    y    = "Total",
    color=alt.Color('County', scale=alt.Scale(scheme='spectral')),
    size="Felony Total",
    tooltip=["County", "Total"],
).interactive()

chart.save('webchart.html', embed_options={'renderer':'svg'})

In [72]:
dropdown = alt.binding_select (options=["Drug Felony","Violent Felony", "Other Felony"], name="Select a size variable:")

selection = alt.selection(type="single", fields=['column'], bind=dropdown, init={'column':'Generosity'})


alt.Chart(df).transform_fold(
    ["Drug Felony","Violent Felony", "Other Felony"],
    as_=['column', 'value']
).transform_filter(
    selection
).mark_circle().encode(
    x = "Felony Total",
    y    = "Total",
    color=alt.Color('County', scale=alt.Scale(scheme='spectral')),
    size="value:Q",
    tooltip=["County", "Total"],
).add_selection(selection)

*Felony correlates with Total crime*

In [25]:
alt.Chart(df).mark_circle().encode(x="Felony Total", y="Total")

In [41]:
alt.Chart(df).mark_circle().encode(
    alt.X(alt.repeat("column"), type="quantitative"),
    alt.Y(alt.repeat("row"), type="quantitative"),
    color="Total",
    tooltip=["County", "Total"]
).properties(
    width=125,
    height=125
).repeat(
    row=["Felony Total", "Drug Felony","Violent Felony", "Other Felony"],
    column=["Felony Total", "Drug Felony","Violent Felony", "Other Felony"]
)

In [40]:
# Build a parallel coordinates plot
alt.Chart(df).transform_window(
    index="count()"
).transform_fold(
    ["Drug Felony","Violent Felony", "Other Felony"]
).mark_line().encode(
    x="key:N",
    y="value:Q",
    detail="index:N",
    opacity=alt.value(0.5),
    color=alt.Color("Felony Total:Q", scale=alt.Scale(scheme="Magma")),
    tooltip=["County"]
).properties(width=700).interactive()

In [11]:
import warnings
warnings.filterwarnings("ignore", message="numpy.dtype size changed")

chart=alt.Chart(df).mark_line().encode(
alt.Y('Total', axis=alt.Axis(title='Total number of Adult Felony'))
,x='Year:N'
,color=alt.Color('County', legend=None),
tooltip=['County','Total']).properties(width=650)

In [13]:
chart

In [15]:
brush = alt.selection(type='interval', encodings=['x'])

upper = alt.Chart().mark_line().encode(
    alt.X('Year:N', scale={'domain': brush.ref()}),
    y='Drug Felony:Q'
,color=alt.Color('County', legend=None)
,tooltip=['County','Drug Felony']
).properties(
    width=650,
    height=300
)
lower = upper.properties(
    height=150
).add_selection(
    brush
)
alt.vconcat(upper, lower, data=df)

In [16]:
highlight = alt.selection(type='single', on='mouseover',
                          fields=['County'], nearest=True)
base = alt.Chart(df).encode(
    x='Year:N',
    y='Drug Misd:Q',
  color=alt.Color('County:N',legend=None),
 tooltip=['County','Drug Misd'])
points = base.mark_circle().encode(
    opacity=alt.value(0)
).add_selection(
    highlight
).properties(
    width=650
)
lines = base.mark_line().encode(
    size=alt.condition(~highlight, alt.value(1), alt.value(3))
)

points + lines

* Data Analysis:

The data shows that New York, Kings, Queens, and Bronx had the highest misdemeanor rates. Drug misdemeanor was initially low until 1980, but it spiked up rapidly in New York by 1990. There was a brief decline in drug misdemeanor for about 3 years, followed by another rapid increase after 1993. However, after 2004, there was an overall decline in crime rates.

* Overall Summary:

Between the 1980s and 1990s, New York, Kings, Queens, and Bronx experienced the highest crime rates, followed by a decline in crime rates. The power of Altair library was demonstrated in this analysis.