# Background and business problem


In this project I will analyze the Baltimore crime data to explore the crime rates and relative safety of Baltimore neighborhoods for the home buyers and real estate developers. Buying a home is one of the single most important household expenditures that a family could incur, often once in a life time. There are many important factors that might affect the decision of the household to purchase a home. Of these the safety of the neighborhood is one the most important considerations that needs due attention and thorough research before deciding to own your next home. In this paper although our target is primarily home buyers, real estate developers will also be interested on the findings of the analysis as the safety of the neighborhood directly affects the housing market and profitability of the real estate.
According to neighborhood scout report, a crime rate of 71 per one thousand residents, Baltimore has one of the highest crime rates in America compared to all communities of all sizes - from the smallest towns to the very largest cities [1]. But in many cases the crime risk assessment across the neighborhoods significantly vary even in a given geographic location. Thus, clustering and segmentation of the neighborhood based on the crime rates would be very important for someone who is planning to own a dream home in this vibrant historic city. In this project I will explore the distribution of total crimes across the neighborhoods and use K-nearest neighborhood (KNN) clustering technique to identify clusters of the safest neighborhoods based on crime rates. Once the safest neighborhoods have been determined, I will use foursquare location data and explore nearby venues and other amenities that might also influence the decision of buying a home such as schools, hospitals, transportation and shopping centers. 

Reference:
1.	https://www.neighborhoodscout.com/blog/top100dangerous

# Data Source and description

For this project I will use victim-based crime data compiled by Baltimore policy department (BPD) and presented through open Baltimore project for public use. The data is geocoded to the approximate latitude/longitude location of the incident and it doesn’t necessarily reflect the exact location of the incident. The data author warns any attempt to match the approximate location of the incident to an exact address is strictly prohibited. The original data comprises about 255010 rows and 16 variables. For this analysis I will use incidents reported from January 2014 to April 2019 when last downloaded from open Baltimore website. The original data set is stored in csv format. According to author’s description the data is a victim-based preliminary report which is subjected to changed. Therefore, the analyst discourages any use of these findings for public consumption. I used this data only for training purpose to demonstrate analytic skills for the data science specialization course capstone. 

# Data cleaning and preparation

Data cleaning is of the most important task that takes large amount of time in a data Science projects. The first step in data cleaning process is importing the required libraries and the dataset from external data file. For this analysis, I imported pandas and other python libraries and read the spreadsheet data file into pandas data frame using the following code. 


In [3]:
import pandas as pd
import itertools
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans 
from sklearn.datasets.samples_generator import make_blobs 
from sklearn import preprocessing
%matplotlib inline

In [4]:
# The code was removed by Watson Studio for sharing.

In [5]:
# Fetch the file
my_file = project.get_file("BPD_Data.csv")

# Read the CSV data file from the object storage into a pandas DataFrame
my_file.seek(0)
myData=pd.read_csv(my_file)
myData.head()

Unnamed: 0,CrimeDate,CrimeTime,CrimeCode,Location,Description,Inside/Outside,Weapon,Post,District,Neighborhood,Longitude,Latitude,Location 1,Premise,vri_name1,Total Incidents
0,4/27/2019,9:23:00 AM,4E,500 N CHARLES ST,COMMON ASSAULT,,,142.0,CENTRAL,Mount Vernon,-76.61555,39.29518,,,,1
1,4/27/2019,9:23:00 AM,4E,500 N CHARLES ST,COMMON ASSAULT,I,,142.0,CENTRAL,Mount Vernon,-76.61555,39.29518,,OTHER - INSIDE,,1
2,4/27/2019,9:15:00 AM,4C,3300 BARCLAY ST,AGG. ASSAULT,O,OTHER,512.0,NORTHERN,Oakenshawe,-76.61105,39.3291,,STREET,,1
3,4/27/2019,9:15:00 AM,5A,3300 SAINT AMBROSE AVE,BURGLARY,,,614.0,NORTHWESTERN,Central Park Heights,-76.67051,39.34205,,,,1
4,4/27/2019,9:12:00 AM,4E,3900 WALNUT AVE,COMMON ASSAULT,I,,424.0,NORTHEASTERN,Overlea,-76.53204,39.36015,,ROW/TOWNHOUSE-OCC,,1


In the next step I inspected the dataset such as the dimension, variables and types using python built in function. Since we I am not going use all the variables for this analysis, I select the variables and generate a subset of the data using the following code.

In [6]:
myData.shape

(255010, 16)

In [7]:
New_Data=myData[['Description','District', 'Neighborhood','Longitude', 'Latitude','Total Incidents']]
New_Data.head()

Unnamed: 0,Description,District,Neighborhood,Longitude,Latitude,Total Incidents
0,COMMON ASSAULT,CENTRAL,Mount Vernon,-76.61555,39.29518,1
1,COMMON ASSAULT,CENTRAL,Mount Vernon,-76.61555,39.29518,1
2,AGG. ASSAULT,NORTHERN,Oakenshawe,-76.61105,39.3291,1
3,BURGLARY,NORTHWESTERN,Central Park Heights,-76.67051,39.34205,1
4,COMMON ASSAULT,NORTHEASTERN,Overlea,-76.53204,39.36015,1


As part of preparation of the data I grouped the data frame by neighborhood and aggregate the total number of incidents for each group. A close examination of the location information revealed that only slight variation observed for each incident. Thus, I tried to use the first longitude and latitude while aggregating the data by neighborhood.

In [8]:
# Group the data frame by Nieghberhood and item and aggregate the total number of incidents from each group
Crime_data1=New_Data.groupby(
    ['Neighborhood']
).agg(
    {
         'Total Incidents':sum,    # Sum of incidents per Neighborhood
         'District': 'first',  # get the first of district
         'Longitude': 'first',  # get the first Longitude value
         'Latitude': 'first'  # get the first Longitude value
    }
)
Crime_data2=Crime_data1.reset_index()
Crime_data2.head()

Unnamed: 0,Neighborhood,Total Incidents,District,Longitude,Latitude
0,Abell,654,NORTHERN,-76.61091,39.32821
1,Allendale,1180,SOUTHWESTERN,-76.68061,39.29281
2,Arcadia,267,NORTHEASTERN,-76.57346,39.33873
3,Arlington,1089,NORTHWESTERN,-76.68454,39.35247
4,Armistead Gardens,793,NORTHEASTERN,-76.56099,39.30815


Finally, I calculated the crime rate per 1000 population for each neighborhood. The dimension of the final dataset will be 283 records for each neighborhood and 6 variables needed for the analysis. I performed descriptive statistics to explore the data and check any anomality, outlier and missing observations. 
For this analysis I tried to segment the neighborhood based on their crime rates and identify the safest neighborhoods. I will also make use of foursquare data to explore the venues such as shopping centers, schools, hospitals and other amenities within the best neighborhood identified.


In [13]:
Crime_data2['CrimeRate']= 1000*(Crime_data2['Total Incidents']/Crime_data2['Total Incidents'].sum())
Crime_data3=Crime_data2.round({"CrimeRate":2})
print(Crime_data3.shape)
Crime_data3.head()

(283, 6)


Unnamed: 0,Neighborhood,Total Incidents,District,Longitude,Latitude,CrimeRate
0,Abell,654,NORTHERN,-76.61091,39.32821,2.59
1,Allendale,1180,SOUTHWESTERN,-76.68061,39.29281,4.68
2,Arcadia,267,NORTHEASTERN,-76.57346,39.33873,1.06
3,Arlington,1089,NORTHWESTERN,-76.68454,39.35247,4.32
4,Armistead Gardens,793,NORTHEASTERN,-76.56099,39.30815,3.14
