This is a preliminary exploration of the crime data. First I want to normalize the data to get a sense of what's going on. So I add up the categories for each city (violent + property + arson) then divide that sum by the city's population. The data is given as strings so I used that  "thousands=','" trick when importing the csv file to remove the commas in large numbers, 

In [None]:
# Import the libraries we need
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # for graphing

df = pd.read_csv('../input/ca_offenses_by_city.csv', thousands=',')

In [None]:
# To start, I will look at the overall crime picture. I'll add them up first, then come back
# and look at the categories more closely
df = df[['City', 'Population', 'Violent crime', 'Property crime', 'Arson']]

# Let's take a look at a row
df.ix[2]

In [None]:
# This bit of code gets the crime columns into int data types, then sums them up
col_list= list(df)
col_list.remove('City')
col_list.remove('Population')

df[['Violent crime','Property crime', 'Arson']] = df[['Violent crime','Property crime', 'Arson']].astype(int)
df['All crimes'] = df[col_list].sum(axis=1)

# Let's look at a summary of the total crime numbers
df['All crimes'].describe()

In [None]:
# And population while we're at it
#df['Population'].describe()
#huge = df['Population'] > 100000
#df[huge].sort_values(by='Population')
df.ix[233]

In [None]:
# Let's create a new dataframe without Los Angeles
dfx = df.drop([233])
dfx['Population'].describe()

In [None]:
# Let's create a normalized "batting average" to compare cities by
# We'll call it 'Crime ratio'
df['Crime ratio'] = df['All crimes'] / df['Population']
df['Crime ratio'].describe()

5 crimes per person? Where is that, Liberty City? We'll need to look at that one. But before that, let's make a scatter plot of population against total crimes.

In [None]:
# Let's graph population vs. crimes
# We should see a positive correlation
x = df.ix[:,'Population']
y = df.ix[:, 'All crimes']

plt.title("California Cities: Population vs. Total Crimes")
plt.xlabel("Population")
plt.ylabel("Total Crimes")
plt.scatter(x, y)
# This next bit of code places a best-fit line on the plot
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)))
plt.show()

That huge number outlier makes the graph a little ugly.  A little tweaking is in order. Good news for LA though; their crime count vs. population is lower than the norm. Taking LA out of the set will act like zooming in on the data.

In [None]:
# Now let's graph population vs. crimes without Los Angeles (the big outlier)
x = dfx.ix[:,'Population']
y = dfx.ix[:, 'All crimes']

plt.title("California Cities: Population vs. Total Crimes (without LA)")
plt.xlabel("Population")
plt.ylabel("Total Crimes")
plt.scatter(x, y)
# This next bit of code places a best-fit line on the plot
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)))
plt.show()

There we go. One big city there has got some 'splainin' to do. It's got about double the expected number of crimes for its population size. So, who is that? Let's get the cities with over 20,000 crimes and sort them by population.

In [None]:
wildWest = dfx['All crimes'] > 20000
dfx[wildWest].sort_values(by='Population', ascending=False)

OK. The Streets of San Francisco. Lt. Mike Stone, where are you when they need you? But notice how high their property crimes are: 88% of their total crimes number. But they're also doing worse on violent crime for their size compared to San Diego and San Jose, who are both doing much better than average on total crime.

Let's look at that crime ratio number to see which cities come out on top and bottom.

In [None]:
low = df['Crime ratio'] < .01
df[low].sort_values(by='Crime ratio')

It looks like Imperial3 wins the prize for Sleepiest Town in the West. It's down by the Mexican border, between the Salton Sea and Mexicali.

In [None]:
high = df['Crime ratio'] > 0.05
df[high].sort_values(by='Crime ratio', ascending=False)

Industry and Vernon3 are weird little towns. They're basically industrial parks. A lot of violent crime in Industry per capita, though, huh? I imagine there's a huge influx of workers every day, and boy, do they get pissed off! Wikipedia to the rescue. Lots of strip clubs too. It is also home of the shopping mall where Doc Emmett Brown and Marty McFly first traveled to the future. Oh, they've got everything. Did I mention they've got a record crime ratio?

This is where it gets stickier. There are many factors that could account for a city's crime rate. I would recommend examining cities in population subsets for fairer comparisons, as well as in geographic subsets.

Now let's move on the the types of crime: violent, property, and arson.

In [None]:
# Let's graph population vs. violent crimes
# We should see a positive correlation
x = df.ix[:,'Population']
y = df.ix[:, 'Violent crime']

plt.title("California Cities: Population vs. Violent Crime")
plt.xlabel("Population")
plt.ylabel("Violent Crime")
plt.scatter(x, y)
# This next bit of code places a best-fit line on the plot
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)))
plt.show()

Los Angeles doesn't fare so well now. Again, let's take a look without them in the dataset.

In [None]:
# Let's graph population vs. violent crimes without LA
# We should see a positive correlation
x = dfx.ix[:,'Population']
y = dfx.ix[:, 'Violent crime']

plt.title("California Cities: Population vs. Violent Crime (without LA)")
plt.xlabel("Population")
plt.ylabel("Violent Crime")
plt.scatter(x, y)
# This next bit of code places a best-fit line on the plot
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)))
plt.show()

The next 3 big cities behind LA: San Diego, San José, and San Francisco show the same pattern we saw in the total crimes graph. But there's new city with worse violent crime stats per population than San Francisco: Oakland. It's the dot below the second "p" in "population" in the graph above.

Let's set up a new column. We'll call it 'Violent Crime Ratio'. But this time we'll just look at the ratio of violent crimes to all crimes.

In [None]:
# Violent crime ratio
df['Violent crime ratio'] = df['Violent crime'] / df['All crimes']
df['Violent crime ratio'].describe()

Again, let's look at the top and the bottom of this new statistic.

In [None]:
low = df['Violent crime ratio'] < 0.025
df[low].sort_values(by='Violent crime ratio')

In [None]:
high = df['Violent crime ratio'] > 0.19
df[high].sort_values(by='Violent crime ratio', ascending=False)

Oh, the places you'll go! Reading up on Willits on Wikipedia got me thinking about the correlation between lead poisoning and crime. Here's a link to a good summary: *Lead: America's Real Criminal Element* http://www.motherjones.com/environment/2016/02/lead-exposure-gasoline-crime-increase-children-health

It turns out Willits has a history of environmental pollution lawsuits concerning a chromium plating plant in the town. Although the lead story deals with leaded gasoline, and Willits has or had a chromium problem, it could be worth exploring if there's any connection between Willits' violent crime stats and chromium exposure. Chromium does come up in the literature when I search on neurological effects of toxic metals. I'd recommend looking at the town's historical crime stats first, to see if this set of stats for Willits is unique.

Now for property crime.

In [None]:
# Let's graph population vs.property crime without LA
x = dfx.ix[:,'Population']
y = dfx.ix[:, 'Property crime']

plt.title("California Cities: Population vs. Property Crime")
plt.xlabel("Population")
plt.ylabel("Property Crime")
plt.scatter(x, y)
# This next bit of code places a best-fit line on the plot
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)))
plt.show()

As we have been seeing, San Francisco is having a rough time with property crime.

And now for Arson:

In [None]:
# Let's graph population vs. Arson without LA
x = dfx.ix[:,'Population']
y = dfx.ix[:, 'Arson']

plt.title("California Cities: Population vs. Arson (without LA)")
plt.xlabel("Population")
plt.ylabel("Arson")
plt.scatter(x, y)
# This next bit of code places a best-fit line on the plot
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)))
plt.show()

So which city is that point directly below the "s" in "Cities"?

In [None]:
high = dfx['Arson'] > 300
dfx[high].sort_values(by='Arson', ascending=False)

Apparently the *Bakersfield Sound* is that of a building on fire.

Let's look at the violent crimes in more detail. I'll set up a new data frame for it.

Let's look a little closer at the violent crimes.  I'm going to create a new data frame from the csv file.

In [None]:
dfv = pd.read_csv('../input/ca_offenses_by_city.csv', thousands=',')
dfv.describe()

In [None]:
# Shorten column names
dfv = dfv[['City','Population','Violent crime', 'Murder and nonnegligent manslaughter', 'Rape (revised definition)','Robbery','Aggravated assault']]
dfv.describe()
dfv.columns = ['City','Population','Violent crime','Murder','Rape','Robbery','Assault']

In [None]:
plt.figure(0)
plt.subplot(2,2,1)
x = dfv.ix[:,'Population']
y = dfv.ix[:, 'Murder']
#plt.title('California Cities: Pop. vs. Murder')
#plt.xlabel('Population')
plt.ylabel('Murder')
plt.scatter(x, y)
plt.grid()

# This next bit of code places a best-fit line on the plot
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)))

plt.subplot(2,2,2)
#x = dfv.ix[:,'Population']
y = dfv.ix[:, 'Rape']
#plt.title('California Cities: Pop. vs. Rape')
#plt.xlabel('Population')
plt.ylabel('Rape')
plt.scatter(x, y)
plt.grid()

# This next bit of code places a best-fit line on the plot
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)))

plt.subplot(2,2,3)
#x = dfv.ix[:,'Population']
y = dfv.ix[:, 'Robbery']
#plt.title('California Cities: Pop. vs. Murder')
#plt.xlabel('Population')
plt.ylabel('Robbery')
plt.scatter(x, y)
plt.grid()

# This next bit of code places a best-fit line on the plot
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)))

plt.subplot(2,2,4)
#x = dfv.ix[:,'Population']
y = dfv.ix[:, 'Assault']
#plt.title('California Cities: Pop. vs. Rape')
#plt.xlabel('Population')
plt.ylabel('Assault')
plt.scatter(x, y)
plt.grid()

# This next bit of code places a best-fit line on the plot
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)))

plt.show()

A quick look at all 4 categories tells us that Los Angeles is a little worse than average in all of them. Let's remove LA from the data set and graph the categories separately.

In [None]:
dfv = dfv.drop([233])

In [None]:
x = dfv.ix[:,'Population']
y = dfv.ix[:, 'Murder']
plt.title('California Cities: Population vs. Murder (w/o LA)')
plt.xlabel('Population')
plt.ylabel('Murder')
plt.scatter(x, y)
plt.grid()

# This next bit of code places a best-fit line on the plot
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)))
plt.show()

Which city is that with 80+ murders? Let's find out.

In [None]:
high = dfv['Murder'] > 40
dfv[high].sort_values(by='Murder', ascending=False)

In [None]:
# Let's look at the murder ratio:
dfv['Murder ratio'] = dfv['Murder'] / dfv['Population']
dfv['Murder ratio'].describe()

In [None]:
high = dfv['Murder ratio'] > 0.0002
dfv[high].sort_values(by='Murder ratio', ascending=False)

Salinas and San Bernardino are two cities with populations > 100,000 that have murder ratios equal to or worse than Oakland's. 

Moving on. Let's look at the rape statistics:

In [None]:
x = dfv.ix[:,'Population']
y = dfv.ix[:, 'Rape']
plt.title('California Cities: Population vs. Rape (w/o LA)')
plt.xlabel('Population')
plt.ylabel('Rape')
plt.scatter(x, y)
plt.grid()

# This next bit of code places a best-fit line on the plot
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)))
plt.show()

At first glance, Oakland stands out, along with San Diego because of its sheer number of rapes, but let's look at the rape ratio.

In [None]:
# Rape ratio
dfv['Rape ratio'] = dfv['Rape'] / dfv['Population']
dfv['Rape ratio'].describe()



In [None]:
high = dfv['Rape ratio'] > 0.0008
dfv[high].sort_values(by='Rape ratio', ascending=False)

Visalia and Vallejo have the highest rape ratios for cities with populations > 100,000.

Now for robbery:

In [None]:
x = dfv.ix[:,'Population']
y = dfv.ix[:, 'Robbery']
plt.title('California Cities: Population vs. Robbery (w/o LA)')
plt.xlabel('Population')
plt.ylabel('Robbery')
plt.scatter(x, y)
plt.grid()

# This next bit of code places a best-fit line on the plot
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)))
plt.show()

Those two high robbery cities stand out:

In [None]:
high = dfv['Robbery'] > 1000
dfv[high].sort_values(by='Robbery', ascending=False)

San Francisco and Oakland. On the other side of the best-fit line, San Diego and San Jose are doing well.

And assaults:

In [None]:
x = dfv.ix[:,'Population']
y = dfv.ix[:, 'Assault']
plt.title('California Cities: Population vs. Assault (w/o LA)')
plt.xlabel('Population')
plt.ylabel('Assault')
plt.scatter(x, y)
plt.grid()

# This next bit of code places a best-fit line on the plot
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)))
plt.show()

In [None]:
high = dfv['Assault'] > 1000
dfv[high].sort_values(by='Assault', ascending=False)

Stockton is an odd one. Its violent crime stats are almost in line with San Diego's, a city over four times as large, but its rape figures are in line with other cities of comparable size.