# Initial setup

Installing the required modules

In [None]:
!pip install ipykernel
!pip install plotly
!pip install --upgrade nbformat
!pip install numpy
!pip install pandas

Importing the modules used throughout the notebook

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

The `state_name_to_code` map is used to encode state names (wherever present) into their 2-letter state abbreviations.

In [2]:
state_name_to_code = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY'
}

# Analysis of the number of cases in each state

One of the most basic geographical analyses we can run on this dataset is counting the number of police shootings recorded per state. This can give us a broad idea of the distribution of cases among the states of the US and analyze areas of higher cases, if any such emerge from the visualization.

In [None]:
df_shootings = pd.read_csv('../datasets/police_shootings_cleaned.csv')

In [4]:
df_shootings_by_state = df_shootings.groupby('state').size().reset_index(name='count')

In [5]:
fig = px.scatter_geo(df_shootings_by_state, locations='state', locationmode='USA-states', hover_name='state', 
        size='count', projection='albers usa', title='Police Shootings by State', width=1000, height=800)
fig.show()

From this scatterplot, it seems to be the case that the number of shooting cases in each state is roughly proportional to the population of the state itself. We can run some analysis on the population of each state to confirm this observation.

# Analysis of the number of shootings versus the population of the state

The dataset we use in this section is a dataset of state population data, where each entry has `2020_census` and `percent_of_total` attributes that respectively correspond to the population of the US state as estimated by the 2020 census and the percentage of the total US population resident in the state.

A basic method to analyze whether there is any correlation between state population (or rather, the percentage of the total population, which is equivalent since the two values are proportional) is to plot bar graphs of both next to each other and compare the outcomes.

In [None]:
df_state_populations = pd.read_csv('../datasets/us_pop_by_state.csv')

In [7]:
df_state_populations = df_state_populations.sort_values('state')
df_shootings_by_state = df_shootings_by_state.sort_values('state')

In [81]:
df_shootings_by_state = df_shootings_by_state[df_shootings_by_state['state'] != 'DC']

In [83]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go

fig = make_subplots(rows=2, cols=1, subplot_titles=('Population Fraction by State', 'Police Shootings by State'))

fig.add_trace(go.Bar(x=df_state_populations['state'], y=df_state_populations['percent_of_total'], name='Population'), row=1, col=1)
fig.add_trace(go.Bar(x=df_shootings_by_state['state'], y=df_shootings_by_state['count'], name='Shootings'), row=2, col=1)

fig.show()

As we can see from the bar graphs, the states are fairly similar in the fraction of national population and the number of police shootings recorded. However, the two are not entirely consistent; to study this in more detail we will perform clustering on these states.

# Clustering states by fraction of population and police shootings in the state

## Training the clustering model

Here, we attempt to cluster states into 4 categories based on the number of police shootings in the state and the fraction of the national population resident in the state.

In [75]:
from sklearn.cluster import KMeans

X = pd.DataFrame({
    'state': df_state_populations['state'],
    'percent_of_total': np.zeros(len(df_state_populations['state'])),
    'count': np.zeros(len(df_state_populations['state']))
})
for i in range(len(X['state'])):
    state = X['state'][i]
    X.at[i, 'percent_of_total'] = df_state_populations[df_state_populations['state'] == state]['percent_of_total']
    X.at[i, 'count'] = df_shootings_by_state[df_shootings_by_state['state'] == state]['count']

X_orig = X.copy(deep=False)

X['percent_of_total'] = (X['percent_of_total'] - X['percent_of_total'].min()) / (X['percent_of_total'].max() - X['percent_of_total'].min())
X['count'] = (X['count'] - X['count'].min()) / (X['count'].max() - X['count'].min())

kmm = KMeans(n_clusters=4, init='k-means++', random_state=42)
kmm.fit(X.drop('state', axis=1))


Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead


Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead



## Visualizing the clustering results

We will plot the states by the cluster assigned to them in a scatterplot, with the X-axis as the number of police shootings in the state and the Y-axis as the fraction of the national population resident in the state, and see if we can gain any insights from the clustering.

In [78]:
clusters = kmm.predict(X.drop('state', axis=1))

fig = px.scatter(X, x='percent_of_total', y='count', color=clusters, width=800, height=800, labels={'percent_of_total': 'Population Fraction (Scaled)', 'count': 'Number of Police Shootings (Scaled)'}, 
        title='K-Means Clustering of States by Population and Police Shootings', hover_name='state')
fig.show()

Clearly, from the plot given above, no meaningful insights can be derived about the states, since the dimensionality of the dataset is low. Most states are simply clustered by the population of the state, since the number of shootings rises almost proportionally.

However, by simply plotting the graph, we can see that the equivalence diagonal is a principal component for almost all points here. We can thus perform another clustering, this time with principal component analysis performed on the data first.

## Clustering using principal component analysis (PCA)

In [11]:
X_pca = X.copy()
X_pca['x'] = (X['count'] + X['percent_of_total']) / np.sqrt(2)
X_pca['y'] = (X['count'] - X['percent_of_total']) / np.sqrt(2)
X_pca['x'] = (X_pca['x'] - X_pca['x'].min()) / (X_pca['x'].max() - X_pca['x'].min())
X_pca['y'] = (X_pca['y'] - X_pca['y'].min()) / (X_pca['y'].max() - X_pca['y'].min()) - 0.5

kmm = KMeans(n_clusters=4, init='k-means++', random_state=42)
kmm.fit(X_pca.drop(columns=['state', 'percent_of_total', 'count'], axis=1))
# kmm.fit(X_pca.drop('state', axis=1))

## Visualizing the clustering (with PCA) results

In [12]:
clusters_pca = kmm.predict(X_pca.drop(columns=['state', 'percent_of_total', 'count'], axis=1))
# clusters_pca = kmm.predict(X_pca.drop(columns=['state'], axis=1))

fig = px.scatter(X_pca, x='x', y='y', color=clusters_pca, width=800, height=800, 
        title='K-Means Clustering of States by Population and Police Shootings', hover_name='state')
fig.show()

We can see here that the states of CA, TX and NY are outliers (CA and TX due to their high populations, NY due to the lower shooting cases); however, the other 47 states make up two clusters which can almost evenly be divided into 2 by the component line `y = 0.13`.

The purple states (`cluster = 1`) represent states with higher shootings per capita and the blue states (`cluster = 0`) represent states with relatively lower shootings per capita.

Also, extending from our analysis in A1, states with "higher density" (AZ, CO, NM, OK) are all in the 1st cluster; whereas states with "lower density" (CT, MA, NJ, NY) are all in the 0th cluster (bar NY, which has such a low case density that it is an outlier). Hence, our hypothesis in A1 is also consistent with the given clusterting.

# Correlation of case density of the state with gun laws in the state

As can be seen in the dataset, most shooting cases occur when the victims are armed with firearms or guns. Thus, it would be useful to look into states and see the number of police shootings per capita and the laxness of gun laws in the state, and try to find a correlation between them.

The expected outcome is that since criminals armed with firearms are more present in states with more gun laws, there should be a positive correlation between the laxness of the laws in the state and the number of shooting cases per capita.

The dataset we use here is a dataset that stores a list of gun strictness laws, and each entry is whether a state has any provision for the same, as of 2019. There are 134 such laws, and the final total is the count of the number of laws that the respective state upholds. This is the **strictness score**; the **laxness score** can be calculated simply as `134 - strictness_score`.

In [None]:
df_gun_strictness = pd.read_csv('../datasets/gun_strictness.csv')

In [84]:
X_laxness = pd.DataFrame({
    'state': df_gun_strictness['state'],
    'laxness': np.zeros(len(df_gun_strictness['state'])),
    'case_density': np.zeros(len(df_gun_strictness['state']))
})
for i in range(len(X_laxness['state'])):
    state = X_laxness['state'][i]
    X_laxness.at[i, 'laxness'] = 134 - df_gun_strictness[df_gun_strictness['state'] == state]['strictness']
    X_laxness.at[i, 'case_density'] = X_orig[X_orig['state'] == state]['count'] / X_orig[X_orig['state'] == state]['percent_of_total']

X_laxness['laxness'] = (X_laxness['laxness'] - X_laxness['laxness'].min()) / (X_laxness['laxness'].max() - X_laxness['laxness'].min())
X_laxness['case_density'] = (X_laxness['case_density'] - X_laxness['case_density'].min()) / (X_laxness['case_density'].max() - X_laxness['case_density'].min())


Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead


Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead


Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead


Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead


Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead


Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead


Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead


Calling float on a single element Series is deprecated and will raise a TypeError in the 

## Clustering over the laxness and case density

Next, we will try to cluster states by their (scaled) gun laxness and their case density.

In [15]:
from sklearn.cluster import KMeans

kmm_laxness = KMeans(n_clusters=4, init='k-means++', random_state=42)
kmm_laxness.fit(X_laxness.drop('state', axis=1))

## Visualizing the clustering over laxness and case density results

In [16]:
clusters_laxness = kmm_laxness.predict(X_laxness.drop('state', axis=1))

fig = px.scatter(X_laxness, x='laxness', y='case_density', color=clusters_laxness, width=800, height=800, 
        labels={'laxness': 'Gun Laxness (Scaled)', 'case_density': 'Police Shooting Density (Scaled)'},
        title='K-Means Clustering of States by Gun Laxness and Police Shooting Density', hover_name='state')
fig.show()

We notice from the graph here that states with higher gun laxness may or may not have a high case density, but states with a lower gun laxness definitely have a lower case density. This incidates that having stricter gun laws is usually a successful preventive measure for police shootings.

Moreover, in our A1 hypothesis, the points we had chosen for low case density (CT, MA, NJ, NY) are all present in `cluster = 1` (states with low case density as well as low gun laxness); and the point we had chosen for high case density (AZ, CO, NM, OK) are all present in `cluster = 2` (states with high case density as well as high gun laxness).

We extend the same hypothesis to the other states in the clusters `[1, 2]` and perform further analysis on those states.

# Analysis of states with low case density and low laxness against states with high case density and high laxness

In [17]:
X_laxness_filtered = X_laxness.copy(deep=False)
X_laxness_filtered['cluster'] = clusters_laxness
X_laxness_filtered = X_laxness_filtered[(X_laxness_filtered['cluster'] == 1) | (X_laxness_filtered['cluster'] == 2)]

In [70]:
import plotly.express as px

cscale = [[0, px.colors.sequential.Plasma[0]], [1, px.colors.sequential.Plasma[3]]]

fig = px.choropleth(X_laxness_filtered, locations='state', locationmode='USA-states', hover_name='state', 
        color='cluster', projection='albers usa', title='States with Low Case Density and Low Laxness vs. States With High Case Density and High Laxness', 
        color_continuous_scale=cscale, color_continuous_midpoint=1.5, width=1000, height=800)
fig.show()

The visualization above shows the states in the clusters formed by states of low case density, low gun laxness and high case density, high gun laxness repsectively. We will attempt to analyze the respective states' political alignment in order to see whether it corresponds with any trends in the cluster or not.

## Analysis by political alignment of the state

Political alignment refers to the state's overall tendency to vote **Democratic** (blue) or **Republican** (red) in the presidential elections. We have used a dataset that stores the alignment of each state in the form `2 - (no. of times the state has voted blue in the last 4 elections)`. For example, if a state voted for the Republican party in 3 of the last 4 elections, their alignment will be `2 - 3 = -1`.

In [None]:
df_state_alignments = pd.read_csv('../datasets/red_blue_states.csv')

In [79]:
import plotly.express as px

fig = px.choropleth(df_state_alignments, locations='state', locationmode='USA-states', hover_name='state', 
        color='alignment', projection='albers usa', title='Political Alignments by State', 
        color_continuous_scale=px.colors.diverging.RdBu, color_continuous_midpoint=0, width=1000, height=800)
fig.show()

To set up a comparison between states of clusters 1 and 2, we will identify the number of states that voted red and voted blue inside each cluster.

In [None]:
X_laxness_filtered = X_laxness_filtered.merge(df_state_alignments[['state', 'alignment']], on='state')

In [80]:
import plotly.graph_objects as go

fig = make_subplots(rows=2, cols=1, subplot_titles=('Low Case Density and Low Laxness', 'High Case Density and High Laxness'))

fig.add_trace(go.Bar(y=X_laxness_filtered[X_laxness_filtered['cluster'] == 1]['state'],
    x=X_laxness_filtered[X_laxness_filtered['cluster'] == 1]['alignment'], name='Alignment', orientation='h',), row=1, col=1)

fig.add_trace(go.Bar(y=X_laxness_filtered[X_laxness_filtered['cluster'] == 2]['state'],
    x=X_laxness_filtered[X_laxness_filtered['cluster'] == 2]['alignment'], name='Alignment', orientation='h'), row=2, col=1)

fig.update_xaxes(range=[-2, 2])
fig.update_layout({'width': 600, 'height': 800})
fig.show()

While states that have a high case density and high laxness tend to be politically aligned towards the Republican party, there is no clear pattern as to their alignment. On the other hand, states that have a low case density and low gun laxness are politically aligned towards the Democratic party.