## Introduction

For our project, we were tasked with using datasets from the Western Pennsylvania Regional Data Center in order to come up with a "best neighborhood" via a matric of our choosing. After discussing, we decided to focus on the saftey of each neighborhood, as we all agreed that it is important to live in a safe neighborhood. We considered doing other metrics such as quality of education and acccess to food, but ended up going with saftey as the main metric to focus on, as much of the datasets on the WPRDC are directly related to saftey.

## The Metrics

### Metric 1: Arrest Rates

In order to measure saftey, one metric that we can look at is the arrest data and how frequent arrests are in each neighborhood. Generally a neighborhood with lower arrest rates would be safer. From the Western Pennsylvania Regional Data Center, we can use both the arrest data dataset provided by the Pittsburgh police and the 2020 census dataset provided by the 2020 census for the neighborhood populations.

To fairly measure this metric, we need both the arrest data and the population data in order to look at the rate of crime rather than just the raw arrest numbers. That way, neighborhoods can be compared relatively, and we won't have a scenario where all the best neighborhoods are just the smallest and all the worst ones are the largest, as they would have lower and higher arrest rates respectively.

First, we will import all the necessary libraries in order to read, manipulate and display the data. Then we will read both of the datasets into corresponding data frames.

In [None]:
import pandas as pd
from collections import defaultdict
import matplotlib.pyplot as plt

# read in arrest data
arr = pd.read_csv("ArrestDataset.csv", sep=",")
#read in population data
pop = pd.read_csv("NeighborhoodPopulation.csv", sep=",")

Next, we will create a 2 dictionaries. One for a neighborhood-population realtion and one for a neighborhood-number of offenses relation. For each of them, we will itterate through the corresponding dataframe and populate the dictionary with the neighborhood as the key. The value pair for population is the 2020 total population for that neighborhood, and for arrests we just incriment that neighborhood by one in order to increase the count of arrests.



In [None]:
#define dict for populations
population = {}
#iterate and assign population to each neighborhood
for index,row in pop.iterrows():
    population[row["Neighborhood"]] = row["2020_Total_Population"]
    
#define dict for offenses
offenses =  defaultdict(int)
#iterate and assign # of offenses to each neighborhood
for index,row in arr.iterrows():
    offenses[row["INCIDENTNEIGHBORHOOD"]] += 1

Next we will define a set of neighborhoods that are contained within our population dataset that aren't contained in our arrest dataset. This is to ensure that we only look at neighborhoods that we have a well defined population for, and don't try to access something that doesn't exist within the arrest dataset.

In [None]:
#create sets of neighborhoods
arr_neighborhoods = set(arr['INCIDENTNEIGHBORHOOD'])
pop_neighborhoods = set(pop['Neighborhood'])
#get missing neighborhoods by set difference
missing_neighborhoods = pop_neighborhoods - arr_neighborhoods

Now that we fully prepared our two datasets, we can create a crimerate dictionary to map the crimerate relative to population to each neighborhood. We will then sort that into a list.

In [None]:
#make crimerate dict
crimerate = {}
#iterate through population
for key in population:
    #make sure we are only accessing neighborhoods in both datasets
    if key not in (missing_neighborhoods):
        #calculate crime rate for each neighborhood
        crimerate[key] = offenses[key] / population[key]

#sort by crimerate
sorted_crimerate = sorted(crimerate.items(), key=lambda x: x[1])

Looking into the crime rate of each neighborhood, we want to make sure that we're getting values that make sense. The only values that don't make sense are the ones with the highest arrest rates, much higher than 1 arrest per person, which is unrealistic. This could be due to a variety of reasons, the recency of the arrest data, the way neighborhood regions are divided beween datasets, small populations in neighborhoods or more. Luckily, that is only true of the neighborhoods with the highest arrest rate, which we are not interested in. We will avoid this by examaning only the top contendors later.

In [None]:
for key, value in sorted_crimerate:
    print(f"Neighborhood: {key}, Crime Rate: {value}", "pop: ", population[key])

Now, we will create a dataframe using our sorted crimerate list, with a coulmn for neighborhood name and one for the crime rate we jusr calculated. At the same time, we will make another dataframe of the 10 neighborhoods with the lowest crimerate for easier visualization.

In [None]:
#make crimerate dataframe
crimerate_df = pd.DataFrame(sorted_crimerate, columns=['Neighborhood', 'Crime Rate'])
#make top 10 neighborhoods by crimerate dataframe
top_10_neighborhoods = crimerate_df.sort_values(by='Crime Rate', ascending=True).head(10)

Now we will vizualize the results of our data exploration using matplotlib. Looking at all of the neighborhoods at once is overwhelming, as the outlier data mentioned above makes the data hard to see, so we can cut off the last 30 neighborhoods to get a feel for the bulk of the data, and see those neighborhood's names easier.

In [None]:
plt.figure(figsize=(10, 6))
plt.bar(crimerate_df['Neighborhood'], crimerate_df['Crime Rate'], color='skyblue')
plt.xlabel('Neighborhood')
plt.ylabel('Crime Rate')
plt.title('Top 10 Neighborhoods by Lowest Crime Rate in Pittsburgh')
plt.xticks(rotation=45, ha='right')
plt.show()

Cutting off the bottom 30 lets us see that most neighborhoods lie below 30%

In [None]:
plt.figure(figsize=(10, 6))
plt.bar(crimerate_df.head(59)['Neighborhood'], crimerate_df.head(59)['Crime Rate'], color='skyblue')
plt.xlabel('Neighborhood')
plt.ylabel('Crime Rate')
plt.title('Top 10 Neighborhoods by Lowest Crime Rate in Pittsburgh')
plt.xticks(rotation=45, ha='right')
plt.show()

Now let's just look at the top 10 neighborhoods


In [5]:
plt.figure(figsize=(10, 6))
plt.bar(top_10_neighborhoods['Neighborhood'], top_10_neighborhoods['Crime Rate'], color='skyblue')
plt.xlabel('Neighborhood')
plt.ylabel('Crime Rate')
plt.title('Top 10 Neighborhoods by Lowest Crime Rate in Pittsburgh')
plt.xticks(rotation=45, ha='right')
plt.show()

NameError: name 'plt' is not defined

We can see from the bar graph that the neighborhood that has the lowest number of arrests proportional to the population of the neighborhood is Central Northside.