# Week 1 - Introduction

By Group XX:

-   Aleksandar Lukic - s194066
-   Paula Barho - s242926
-   Victor Gustav Harbo Rasmussen - s204475

## Part 1: Predictive policing. A case to learn from

Start by reading the article from [sciencemag.org](https://www.sciencemag.org/news/2016/09/can-predictive-policing-prevent-crime-it-happens).

We will be using data from [dataSF](https://datasf.org/opendata/).

- According to the article, is predictive policing better than best practice techniques for law enforcement? The article is from 2016. Take a look around the web, does this still seem to be the case in 2024? (hint, when you evaluate the evidence consider the source)

- List and explain some of the possible issues with predictive policing according to the article.


## Part 2: Load some crime-data into your Jupyter notebook

Using pandas, we will be loading data from local files.

In [111]:
import os
import numpy as np
import pandas as pd

### Preamble for Pandas display options

These options enables the Pandas output to be fully displayed and expanded

In [112]:
# Set pandas display options to show all columns for .head command
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.width', None)        # Auto-detect the display width
pd.set_option('display.max_colwidth', None) # Show full content of each column

### Get data from .csv

Get the datasets from the data folder in the repository:

In [None]:
# Get path of data directory
data_path = os.path.abspath(os.path.join(os.pardir, "data"))
data_path

In [114]:
# Load data from csv files
csv_1_name = "Police_Department_Incident_Reports__Historical_2003_to_May_2018_20250210.csv"
csv_2_name = "Police_Department_Incident_Reports__2018_to_Present_20250210.csv"
csv_1_path = os.path.join(data_path, csv_1_name)
csv_2_path = os.path.join(data_path, csv_2_name)
df_1 = pd.read_csv(csv_1_path)
df_2 = pd.read_csv(csv_2_path)

In [None]:
# Display shape of dataframes
print("df_1:", df_1.shape)
print("df_2:", df_2.shape)

### Examine the datasets

In order to be able to concatinate the two datasets, we must ensure that they are of same diminsonality and naming- and type-conventions.

In [None]:
df_1.info()

In [None]:
df_1.head()

In [None]:
df_2.info()

In [None]:
df_2.head()

### Shrinking the data

The datasets are very large and contain some informations which we are currently not interested in keeping. Thus, we can extract the columns that are useful and discard the remaining.

In [120]:
# Columns to keep for df_1
columns_to_keep_1 = [
    'Date',
    'Time',
    'Category',
    'DayOfWeek',
    'X',
    'Y',
    'PdDistrict'
]

In [121]:
# Columns to keep for df_2
columns_to_keep_2 = [
    'Incident Date',
    'Incident Time',
    'Incident Category',
    'Incident Day of Week',
    'Latitude',
    'Longitude',
    'Police District'
]

In [122]:
# Extract only the columns specified for keeping
df_1 = df_1[columns_to_keep_1]
df_2 = df_2[columns_to_keep_2]

In [None]:
df_1.head()

In [None]:
df_2.head()

### Rename the columns

Firstly, we can start by renaming the columns so that the can be joined later on:

In [125]:
df_1 = df_1.rename(columns={
    'Date': 'Date',
    'Time': 'Time',
    'Category': 'Category',
    'DayOfWeek': 'Day of Week',
    "X": "Longitude (X)",
    "Y": "Latitude (Y)",
    'PdDistrict': 'Police District'
    }
)

df_2 = df_2.rename(columns={
    'Incident Date': 'Date',
    'Incident Time': 'Time',
    'Incident Category': 'Category',
    'Incident Day of Week': 'Day of Week',
    'Longitude': 'Longitude (X)',
    'Latitude': 'Latitude (Y)',
    'Police District': 'Police District'
    }
)

### Align date and time formats

The two datasets abide by different time conventions. Thus, it is neccessary to align them with a single convention.

In [None]:
df_1.head(1)

In [None]:
df_2.head(1)

In [128]:
# Parsing the date and time
df_1_time = pd.DataFrame(df_1)

# Convert "Date" column to datetime format
df_1_time["Date"] = pd.to_datetime(df_1_time["Date"], format="%m/%d/%Y")

# Create new columns from "Date"
df_1_time["Day"] = df_1_time["Date"].dt.day
df_1_time["Month"] = df_1_time["Date"].dt.strftime("%B")  # Month name
df_1_time["Year"] = df_1_time["Date"].dt.year

# Extract the hour from the "Time" column to create "TimeOfDay"
df_1_time["Hour"] = pd.to_datetime(df_1_time["Time"], format="%H:%M").dt.hour

# Drop the original "Date" and "Time" columns
df_1_time = df_1_time.drop(columns=["Date", "Time"])

In [129]:
# Parsing the date and time
df_2_time = pd.DataFrame(df_2)

# Convert "Date" column to datetime format
df_2_time["Date"] = pd.to_datetime(df_2_time["Date"], format="%Y/%m/%d")

# Create new columns from "Date"
df_2_time["Day"] = df_2_time["Date"].dt.day
df_2_time["Month"] = df_2_time["Date"].dt.strftime("%B")  # Month name
df_2_time["Year"] = df_2_time["Date"].dt.year

# Extract the hour from the "Time" column to create "TimeOfDay"
df_2_time["Hour"] = pd.to_datetime(df_2_time["Time"], format="%H:%M").dt.hour

# Drop the original "Date" and "Time" columns
df_2_time = df_2_time.drop(columns=["Date", "Time"])

In [None]:
df_1_time.head(1)

In [None]:
df_2_time.head(1)

In [132]:
# Override the original dataframes with the new ones
df_1 = df_1_time
df_2 = df_2_time

### Align Category and Police Department columns

The two datasets both store some lookup values for the Category and Police Department columns respectivily. However, the values are not formated the same and thus will not be seen as equal to each other. Therefore, we need to align these manually.

In [None]:
print(df_1['Category'].unique())

In [None]:
print(df_2['Category'].unique())

Since the second dataset is capitalized, we can convert it to uppercase.

In [135]:
df_2['Category'] = df_2['Category'].str.upper()

In [None]:
print(df_2['Category'].unique())

In [None]:
categories = set(np.concatenate((
    df_1['Category'].unique(), 
    df_2['Category'].unique()
    ), axis=0
))

print("No. of incedent categories:", len(categories))

In [None]:
categories

Since the naming between the two datasets is a bit off, we adjust the names in the second dataset.

In [139]:
category_mapping = {
    'DRUG VIOLATION': 'DRUG/NARCOTIC',
    'DRUG OFFENSE': 'DRUG/NARCOTIC',
    'LARCENY THEFT': 'LARCENY/THEFT',
    'MALICIOUS MISCHIEF': 'VANDALISM',  
    'MOTOR VEHICLE THEFT': 'VEHICLE THEFT',
    'MOTOR VEHICLE THEFT?': 'VEHICLE THEFT',
    'WEAPONS CARRYING ETC': 'WEAPON LAWS',
    'WEAPONS OFFENCE': 'WEAPON LAWS',
    'WEAPONS OFFENSE': 'WEAPON LAWS',
    
    # Additional mappings for edge cases
    'TRAFFIC VIOLATION ARREST': 'DRIVING UNDER THE INFLUENCE',  # If DUI is included here 
    'SUSPICIOUS OCC': 'TRESPASS',
    'SUSPICIOUS': 'TRESPASS',
    'LIQUOR LAWS': 'DRUNKENNESS'  # If liquor law violations include public drunkenness 
}

# Replace the categories in the dataframes with the new mappings
df_2['Category'] = df_2['Category'].replace(category_mapping)

We now want to keep only the rows containing the focus crimes as defined below

In [140]:
focus_crimes = [
    'ASSAULT',
    'BURGLARY',
    'DISORDERLY CONDUCT',
    'DRIVING UNDER THE INFLUENCE',
    'DRUG/NARCOTIC',
    'DRUNKENNESS',
    'LARCENY/THEFT',
    'PROSTITUTION',
    'ROBBERY',
    'STOLEN PROPERTY',
    'TRESPASS',
    'VANDALISM',
    'VEHICLE THEFT',
    'WEAPON LAWS'
]

In [141]:
# Extract only the rows where the "Category" is in the focus_crimes list
df_1_filtered = df_1[df_1['Category'].isin(focus_crimes)]
df_2_filtered = df_2[df_2['Category'].isin(focus_crimes)]

In [None]:
print("df_1:", df_1.shape)
print("df_2:", df_2.shape)
print("df_1_filtered:", df_1_filtered.shape)
print("df_2_filtered:", df_2_filtered.shape)

In [143]:
df_1 = df_1_filtered
df_2 = df_2_filtered

We can now do the same for the police department column.

In [None]:
print(df_1['Police District'].unique())

In [None]:
print(df_2['Police District'].unique())

In [None]:
df_2['Police District'] = df_2['Police District'].str.upper()

In [None]:
police_districts = set(np.concatenate((
    df_1['Police District'].unique(), 
    df_2['Police District'].unique()
    ), axis=0
))

print("No. of Police Districts:", len(police_districts))

In [None]:
police_districts

### Merging the two datasets into one

Finally, we can now merge the two datasets by aligning their columns and then concatenating them together.

In [149]:
# Ensure that both DataFrames have the same columns in the same order
columns = [
    'Category', 
    'Police District', 
    'Longitude (X)', 
    'Latitude (Y)',  
    'Day of Week',
    'Hour', 
    'Day', 
    'Month', 
    'Year'
]

df_1 = df_1[columns]
df_2 = df_2[columns]

In [150]:
df_merged = pd.concat([df_1, df_2], axis=0, ignore_index=True)

In [None]:
df_merged.head()

In [152]:
# Sort the data by Year, Month, Day, and Hour in ascending order
df_sorted = df_merged.sort_values(
    by=['Year', 'Month', 'Day', 'Hour'], 
    ascending=[True, True, True, True], 
    na_position='last'
)

In [153]:
# Reset the index of the sorted DataFrame
df_reindexed = df_sorted.reset_index()
df_reindexed.drop(columns=['index'], inplace=True)


In [None]:
print("df_reindexed:", df_reindexed.shape)

In [None]:
df_reindexed.head()

In [None]:
df_reindexed.tail()

In [157]:
# Save the cleaned and merged data to a new csv file
cleaned_data_path = os.path.join(data_path, "Police_Department_Incident_Reports_Complete.csv")
df_reindexed.to_csv(cleaned_data_path, index=False)

### Simple statistics

Now generate the following simple statistics
- Report the total number of crimes in the dataset.
- List the various categories of crime. How many are there?
- List the number of crimes in each category.

In [None]:
# Printing total number of crimes
total_crimes = df_reindexed.shape[0]
print(f"Total number of crimes: {total_crimes}")

# List the number of crimes in each category
category_counts = df_reindexed['Category'].value_counts()
print("Number of crimes in each category:")
print(category_counts)

### Exercise 2: The types of crimes.

- We have already counted the number of crimes in each category. What is the most commonly occurring category of crime? What is the least frequently occurring?

The most commonly occurring category of crime is larceny/theft and the lest frequently occurring one is drunkenness. 

- Did you run into categories changing across your two data periods? If yes, think about how to deal with those issues. There's no right answer but reflect on your decisions. (And don't spend too much energy on this, since we'll only be working on a subset of the crimes long-term, see Focus Crimes below.)

Few categories overlap, naming is different, the number of crimes reported increases drastically. We only kept the same categories across both data sets. 

- Create a bar-plot over crime occurrences. 

In [None]:
import matplotlib.pyplot as plt

# Create a bar plot
plt.figure(figsize=(10, 6))
category_counts.plot(kind='bar')

# Add title and labels
plt.title('Crime Occurrences by Category')
plt.xlabel('Category')
plt.ylabel('Number of Crimes')

# Rotate x-axis labels for better readability
plt.xticks(rotation=45, ha='right')

# Show the plot
plt.tight_layout()
plt.show()

### Exercise 3: Temporal patterns.

- What is the year with most crimes?
- What is the year with the fewest crimes?.
- Create a barplot of crimes-per-year (years on the -axis, crime-counts on the -axis).
- Finally, Police chief Suneman is interested in the temporal development of only a **subset of categories, the so-called *focus crimes***. Those categories are listed below (for convenient copy-paste action). Create bar-charts displaying the year-by-year development of each of these categories across the years 2003-2017.

In [None]:
# Group by year and count the number of crimes
yearly_crime_counts = df_reindexed['Year'].value_counts()

# Find the year with the most crimes
year_most_crimes = yearly_crime_counts.idxmax()
most_crimes = yearly_crime_counts.max()

# Find the year with the fewest crimes
year_fewest_crimes = yearly_crime_counts.idxmin()
fewest_crimes = yearly_crime_counts.min()

print(f"The year with the most crimes is {year_most_crimes} with {most_crimes} crimes.")
print(f"The year with the fewest crimes is {year_fewest_crimes} with {fewest_crimes} crimes.")