In [236]:
import pandas as pd
import numpy as np
import sklearn.ensemble as sk
%pylab inline

df = pd.read_csv('../datasets/CommvsCrime.csv')

#remove the ':' from the column names
columns = list(df.columns)
for i, col in enumerate(columns):
    columns[i] = col.replace(':', '') 
df.columns = columns

df = df.replace('?', np.nan)  # replace ? with NaN to get a proper count
df.count()

Populating the interactive namespace from numpy and matplotlib


communityname            2215
state                    2215
countyCode                994
communityCode             991
fold                     2215
population               2215
householdsize            2215
racepctblack             2215
racePctWhite             2215
racePctAsian             2215
racePctHisp              2215
agePct12t21              2215
agePct12t29              2215
agePct16t24              2215
agePct65up               2215
numbUrban                2215
pctUrban                 2215
medIncome                2215
pctWWage                 2215
pctWFarmSelf             2215
pctWInvInc               2215
pctWSocSec               2215
pctWPubAsst              2215
pctWRetire               2215
medFamInc                2215
perCapInc                2215
whitePerCap              2215
blackPerCap              2215
indianPerCap             2215
AsianPerCap              2215
OtherPerCap              2214
HispPerCap               2215
NumUnderPov              2215
PctPopUnde

Since 90% of the cities have the ViolentCrimesPerPop or nonViolperPop attribute I will use the sum of these attributes for the measurement of the total crime in a community. I will also be only selecting cities that have both of these attributes. I also will drop the country and community code colums since they aren't very important and are missing a lot of information.

In [237]:
# get rows where both values are numbers
df.dropna(subset=['ViolentCrimesPerPop', 'nonViolPerPop'], inplace=True)
df = df.convert_objects(convert_numeric=True)
df['TotalCrime'] = df.nonViolPerPop.add(df.ViolentCrimesPerPop, axis=0)

# delete any columns that have fewer rows of numbers
for col in df.columns:
    if df[col].count() < df.nonViolPerPop.count():
     df.drop(col, axis=1 ,inplace=True)

When testing a hypothesis I learned that I should have a train group to train my model and a test group to evaluate how effective my model is. I also realize I don't really have a hypothesis yet, so it looks like I don't know what I am doing.

### Hypothesis
So I guess my hypothesis is that the percentage of minorities in a community is strongly correlated to the ratio between crime and population.

In [238]:
# code from stackoverflow to seperate test and training group
msk = np.random.rand(len(df)) < 0.8
train_data = df[msk]
test_data = df[~msk]

Now I've cleaned up my data, I will now start to begin the processing.

In [239]:
# now select only the predictors(independent) and the Goals(dependent)
predict = train_data.iloc[:,5:104]
totalCrime = train_data.TotalCrime

# run random Forest
rfc = sk.RandomForestClassifier(n_estimators= 1, oob_score=True) # n_estimators originally set at 50
model = rfc.fit(predict, totalCrime)

print rfc.oob_score_





0.000651890482399


### My Mistakes
So the out of bag score was 0, and for the longest time I couldn't understand why. It wasn't until I read exactly what the out of bag score was and reviewed Random Forest Classification did I come to understand why my score was 0.0 . The reason is there is no classification in the data. I need to use another form of analysis (like multivariable linear regression) or classify the data by hand, and I've decided to do the latter.

### Research
What I've realized I am interested in is not just the pure crime rate but determining what makes a city have a bad crime rate. So my first order of business is to determine what a bad crime rate is. In the United States there is no formal classification of cities based on their crime rate, this means I will have to create my own definition. The ghetto is one of the worst places to live in the United States due to the fact these areas have high crime rates. Most U.S citizens would agree that a cities crime rate is too high if it is similiar to that of a ghetto. So if I find the average crime rate of ghettos than I can use that information as the basis for my classification.

#### Defining Ghetto
Turns out the United States does not have any official measurements to determine what constitutes a ghetto, in fact the U.S doesn't even give an official definition of what a ghetto is. Luckily there is a classification officially defined by the U.S Census Bureau called **Concentrated Poverty** that does a good job encapsulating what the ghetto is. The definition given by the bureau is an area with 40 percent of the tract population living below the federal poverty threshold [1]. While it is argued that the 40 percent cutoff is arbitrary, scholars recognize that the cutoff did correspond closely with neighborhoods that city officials and Local Census Bureau Officials considered ghettos [2].

So we will now classify each community based on how its crime rate relates to the crime rate of the average area that has concentrated poverty (effectively making it a ghetto).

### Getting the data
Getting U.S census data was not fun, and was probably a lot more work than it was worth. The data on poverty in the U.S is very granular and goes down all the way to U.S census tracts but the best data I could find for crime in the same time frame is down to the city level for cities with populations above 25,000. In hindsight I probably could've used the crime data from UCI that we are analyzing now but oh well. So I got the poverty levels for the large cities but to make up for the change in size (from U.S census tract to U.S city) I've changed the sample size by selecting areas with poverty levels above 20%. Areas with poverty levels above 20% are still above the national average and are just as concerning since cities are much larger than tracts. 

After messing around in Excel I was able to clean up the data and get the poverty rate and crime rate (both violent and nonviolent) of 1172 cities across all 50 states. So lets take a look at the csv I made.





In [240]:
censusdf = pd.read_csv('../datasets/CityCrimePoverty.csv')
censusdf


Unnamed: 0,City,Violent,Nonviolent,Poverty-Families,Poverty-Individuals
0,"Abilene city, Texas",356.301869,3874.341856,10.9,15.4
1,"Agawam city, Massachusetts",736.205263,1858.736059,4.3,5.6
2,"Aiken city, South Carolina",0.000000,0.000000,10.1,14.4
3,"Akron city, Ohio",0.000000,0.000000,14.0,17.5
4,"Alameda city, California",391.141044,3280.663127,6.0,8.2
5,"Alamogordo city, New Mexico",313.089146,4026.393752,13.2,16.5
6,"Albany city, Georgia",679.104384,6405.132125,21.5,27.1
7,"Albany city, New York",1136.479566,6098.630920,16.0,21.7
8,"Albany city, Oregon",219.978002,6724.327567,9.3,11.6
9,"Albuquerque city, New Mexico",1168.005385,7802.849060,10.0,13.5


In [241]:
# The data still has cities with poverty levels below 20%, so I'll filter them out.

# The copy() at the end is to prevent the SettingWithCopy warning
censusdf = censusdf[(censusdf['Poverty-Individuals'] > 20) | (censusdf['Poverty-Families'] > 20)].copy()
censusdf['totalCrime'] = censusdf.Violent.add(censusdf.Nonviolent)
censusdf.describe()


Unnamed: 0,Violent,Nonviolent,Poverty-Families,Poverty-Individuals,totalCrime
count,207.0,207.0,207.0,207.0,207.0
mean,821.369656,5129.38141,18.705314,25.196618,5950.751066
std,615.846368,2962.148459,5.048668,4.612671,3369.685543
min,0.0,0.0,4.1,20.1,0.0
25%,353.434064,3370.365727,16.45,21.9,4057.848046
50%,737.774695,4898.550884,18.5,23.9,6126.701129
75%,1159.697535,6842.648862,21.65,27.0,7677.215451
max,2743.057693,16767.18938,32.8,46.9,18127.454247


In [242]:
# The average total crime for "ghettos" is 5950 cases per 100,000 people
# Now we will classify which communities in our original dataset have crime rates 
# higher than half of the poverty cities.
dangerous = totalCrime > censusdf.totalCrime.quantile(.5)
# convert Booleans to integers
dangerous = dangerous.astype(int)

# Now lets give it another go
# run random Forest
rfc = sk.RandomForestClassifier(n_estimators= 500, oob_score=True)
model = rfc.fit(predict, dangerous)

print rfc.oob_score_

0.842242503259


Wow much better! The performance has improved a lot too, now I can take a look at the importances of each feature

In [243]:
# Loading the importances into an array that will be turned into a dataframe
fi = enumerate(rfc.feature_importances_)
cols = predict.columns
features = []
for i, value in fi:
    features.append([value, cols[i]])

featureDF = pd.DataFrame(features, None, ['importance', 'feature'])
featureDF.sort('importance', ascending=False, inplace=True)

# Lets see the top 10 importances
featureDF[0:10]



    


Unnamed: 0,importance,feature
41,0.057528,PctKids2Par
40,0.049853,PctFam2Par
42,0.039762,PctYoungKids2Par
47,0.035245,PctKidsBornNeverMar
43,0.034826,PctTeen2Par
38,0.033933,TotalPctDiv
37,0.033917,FemalePctDiv
24,0.031379,NumUnderPov
46,0.031183,NumKidsBornNeverMar
25,0.02597,PctPopUnderPov


#### Features
This is interesting it looks like the parenting plays the largest role in determining whether an area is as dangerous as the ghetto 8 out of the top 10 have to do with parenting. This is really insightful, I also thought that having racially segregated areas played the largest role in determining the crime rate but it looks this can be overcome with a strong household.

One thing I've also noticed is that I included pure maginitudes in the data which skews things since they aren't adjusted for population differences, this is an area for improvement. 

Now its time to evaluate the model.

In [244]:
# Get the variables from the test data

# Independent Variables
test_ind = test_data.iloc[:, 5:104]
# Dependent Variable
test_danger = test_data.TotalCrime > censusdf.totalCrime.quantile(.5)

print("mean accuracy score for test set = %f" %(rfc.score(test_ind, test_danger)))

mean accuracy score for test set = 0.839674


## Results
Despite all the mistakes made it looks like our model is pretty solid, each time this notebook is run the data is reshuffled and the accuracy of the model is consistantly at 84%.

### Refrences
[1] Kneebone, Elizabeth. "The Growth and Spread of Concentrated Poverty, 2000 to 2008-2012." The Brookings Institution. N.p., 31 July 2014. Web. 25 May 2015.

[2] Jargowsky, P. and Bane, M. 1991. “Ghetto Poverty in the United States, 1970 to 1980”. in The Urban Underclass edited by Christopher Jencks and Paul E. Peterson. Washington, D.C.: The Brookings Institution.
