# Capstone Project - The Battle of the Neighborhoods (Week 1)
### Applied Data Science Capstone by IBM/Coursera
###### Author: Tim Andrews

## Table of contents
* [Introduction: Safety Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)



## Introduction: Where to Live in DC <a name="introduction"></a>

Even though it is already the Captial of the United States, Washington DC is also one of the fastest growing cities in the country.  Within the borders of the district live over 700,000 people, while the entire metropolitan area has a population over 6.2 million.

Because of the allure of both poltical based jobs, along with growing financial opportunites, many people have decided to move within the district.  When people look for a place to live, two of the major factors that determine where one would like to live is the number of food options nearby (restaurants, coffee shops, etc.), as well as the safety of the area.  Some people will prefer where the crime rate is the lowest, while others will prefer where there are great food options.  

In order to give the best suggestion on a place to live within Washington DC, we will use data science techniques to create a map of the crime rates of different neighborhoods, while also clustering these niehgborhoods accroding to venue density.

## Data <a name="data"></a>

We will consider the following datasets and sources in order to find the best suggestion :
* **Zipcodes** of the different **Washington DC Neighborhoods** https://opendata.dc.gov/datasets/zip-codes/data?geometry=-77.193%2C38.864%2C-76.866%2C38.911&orderBy=TYPE
* GeoJSON file with **Zip Codes Latitutde and Longitude** https://public.opendatasoft.com/explore/dataset/us-zip-code-latitude-and-longitude/table/?refine.state=DC
* .csv file with every **Reported Crime in Washington DC from the past two years**, as well as each **Crimes Latitude and Longitude**  https://dcatlas.dcgis.dc.gov/crimecards/
* **Foursquare API** to get the number and type of food options in each neighborhood

##### Import Libraries

I will try to import all of the needed libraries here, however, I may need to import more later on down the line.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

from bs4 import BeautifulSoup as Soup
import requests
from pandas import DataFrame
import seaborn as sns
import matplotlib.pyplot as plt
from os import path
import numpy as np

print('Libraries imported.')

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Libraries imported.


### A) DC Neighborhood  and ZipCodes

I was able to find zipcode data for DC from the local governments website so I have downloaded and imported the data into my notebook

https://opendata.dc.gov/datasets/zip-codes/data?geometry=-77.193%2C38.864%2C-76.866%2C38.911&orderBy=TYPE

In [2]:
dc_zip = pd.read_csv("C:/Users/Tim/Desktop/CapstoneProject/Zip_Codes.csv")
print(dc_zip['TYPE'].value_counts())
dc_zip.head()


UNIQUE             138
STANDARD            25
POST OFFICE BOX      7
Name: TYPE, dtype: int64


Unnamed: 0,OBJECTID,ZIPCODE,GIS_ID,WEB_URL,EPA_URL,NAME,TYPE,UNINSURED_POPULATION,MEDICAID_RECIPIENT,LABEL,ZIP_CODE_TEXT,Shape_Length,Shape_Area,POP_2000,POP_2010,NEIGHBORHOOD
0,1,20036,ZIP_036,http://www.usps.gov,http://maps.epa.gov/scripts/.esrimap?name=envi...,,STANDARD,27,54,20036,20036,5901.082276,849613.4,3808.0,4764.0,"Dupont Circle, Logan Circle, Shaw"
1,2,20037,ZIP_037,http://www.usps.gov,http://maps.epa.gov/scripts/.esrimap?name=envi...,,STANDARD,91,290,20037,20037,16360.80249,1936378.0,12642.0,14544.0,Foggy Bottom
2,3,20039,ZIP_039,http://www.usps.gov,http://maps.epa.gov/scripts/.esrimap?name=envi...,LAMOND-RIGGS POST OFFICE,POST OFFICE BOX,0,1,20039,20039,36.292949,52.20444,,,
3,4,20040,ZIP_040,http://www.usps.gov,http://maps.epa.gov/scripts/.esrimap?name=envi...,BRIGHTWOOD POST OFFICE,POST OFFICE BOX,0,5,20040,20040,28.57796,34.67719,,,
4,5,20043,ZIP_043,http://www.usps.gov,http://maps.epa.gov/scripts/.esrimap?name=envi...,MARTIN LUTHER KING JR POST OFFICE,POST OFFICE BOX,0,1,20043,20043,35.039224,76.69482,,,


We want to limit our zipcodes to only where TYPE = 'STANDARD'.  These zipcodes are the only ones in the dataset that have population values.

I'm also going to trim off unneeded columns. (not sure if I need the population numbers, but I'll hold onto them)

It should have 25 rows, checking with .shape

In [3]:
dc_zip = dc_zip[['ZIPCODE', 'TYPE', 'POP_2000', 'POP_2010', 'NEIGHBORHOOD']] .loc[dc_zip['TYPE'] == 'STANDARD']
print(dc_zip.shape)
dc_zip.head()

(25, 5)


Unnamed: 0,ZIPCODE,TYPE,POP_2000,POP_2010,NEIGHBORHOOD
0,20036,STANDARD,3808.0,4764.0,"Dupont Circle, Logan Circle, Shaw"
1,20037,STANDARD,12642.0,14544.0,Foggy Bottom
70,20001,STANDARD,33550.0,39296.0,"Penn Quarter, Mount Vernon Square, Howard U"
71,20002,STANDARD,49333.0,51252.0,"Capitol Hill, H Street, Eckington, Trinidad, K..."
72,20003,STANDARD,23122.0,26751.0,Navy Yard


In [4]:
dc_zip.tail()

Unnamed: 0,ZIPCODE,TYPE,POP_2000,POP_2010,NEIGHBORHOOD
89,20024,STANDARD,11795.0,11455.0,Southwest-Waterfront
92,20032,STANDARD,31688.0,33147.0,Congress Heights
103,20597,STANDARD,,,
150,20536,STANDARD,,,
156,20547,STANDARD,,,


It looks like there is 3 zipcodes with no population at all.  Lets remove them from the table for 22 neighborhoods

In [5]:
dc_zip = dc_zip[dc_zip.POP_2010 > 0]
print(dc_zip.shape)

(22, 5)


### B) DC ZipCodes with Latitude and Longitude

I was able to find the latitude and longitude of each zipcode for DC at the following link.  I downloaded the latitudes and longitudes and imported into the notebook

https://public.opendatasoft.com/explore/dataset/us-zip-code-latitude-and-longitude/table/?refine.state=DC

In [6]:
dc_latlong = pd.read_csv("C:/Users/Tim/Desktop/CapstoneProject/dc-zip-code-latitude-and-longitude.csv")
print(dc_latlong.shape)
dc_latlong.head()

(276, 8)


Unnamed: 0,Zipcode,City,State,Latitude,Longitude,Timezone,Daylight savings time flag,geopoint
0,20227,Washington,DC,38.893311,-77.014647,-5,1,"38.893311,-77.014647"
1,20521,Washington,DC,38.893311,-77.014647,-5,1,"38.893311,-77.014647"
2,20557,Washington,DC,38.887405,-77.004663,-5,1,"38.887405,-77.004663"
3,20277,Washington,DC,38.893311,-77.014647,-5,1,"38.893311,-77.014647"
4,20026,Washington,DC,38.893311,-77.014647,-5,1,"38.893311,-77.014647"


This looks pretty clean.  Now I will merge the Latitude and Longitudes from this table onto the neighborhood and zipcodes table.

In [7]:
dc_neighb = pd.merge(dc_zip, dc_latlong, left_on = 'ZIPCODE', right_on = 'Zipcode')
dc_neighb = dc_neighb[['ZIPCODE', 'TYPE', 'POP_2000', 'POP_2010', 'NEIGHBORHOOD', 'Latitude', 'Longitude']]
dc_neighb['ZIPCODE'] = dc_neighb['ZIPCODE'].astype(str)
dc_neighb.head()

Unnamed: 0,ZIPCODE,TYPE,POP_2000,POP_2010,NEIGHBORHOOD,Latitude,Longitude
0,20036,STANDARD,3808.0,4764.0,"Dupont Circle, Logan Circle, Shaw",38.906778,-77.04148
1,20037,STANDARD,12642.0,14544.0,Foggy Bottom,38.900394,-77.05126
2,20001,STANDARD,33550.0,39296.0,"Penn Quarter, Mount Vernon Square, Howard U",38.907711,-77.01732
3,20002,STANDARD,49333.0,51252.0,"Capitol Hill, H Street, Eckington, Trinidad, K...",38.901811,-76.99097
4,20003,STANDARD,23122.0,26751.0,Navy Yard,38.881762,-76.99447


Now, the Washington DC Neihgborhood markers can easily be mapped!

First, I googled Washington DC's Latitude and Longitude coordinates and set them to a variable.

In [8]:
latitude = 38.9072
longitude = -77.0369
print('The geograpical coordinates of Washington DC are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Washington DC are 38.9072, -77.0369.


Next, I used folum to plot the markers of each neighborhood.

In [9]:
map_dc = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, zipcode, neighborhood in zip(dc_neighb['Latitude'], dc_neighb['Longitude'], dc_neighb['ZIPCODE'], dc_neighb['NEIGHBORHOOD']):
    label = '{}, {}'.format(neighborhood, zipcode)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_dc)  
    
map_dc

### C) Foursquare API

Now that the neighborhoods have been defined and marked.  We now need to access the Foursquare API to determine which ones have a high density of food options.

Now we will define a funciton to pull in the closest restaurants to each neighborhood.

In [11]:
category = '4d4b7105d754a06374d81259' #food category

def getNearbyFood(names, latitudes, longitudes, radius = 500):

    venues_list = []
    for name, lat, lng, in zip(names, latitudes, longitudes):
        print(name)
        
        #Here is the API URL creats
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
            CLIENT_ID,
            CLIENT_SECRET,
            VERSION,
            lat,
            lng,
            category,
            radius,
            LIMIT)
        
        #GET request to the Foursquare API
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        #sets what information to return for the venues
        venues_list.append([(
            name,
            lat,
            lng,
            v['venue']['name'],
            v['venue']['location']['lat'],
            v['venue']['location']['lng'],
            v['venue']['categories'][0]['name']) for v in results])
        
    nearby_food = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_food.columns = ['Neighborhood',
                          'Neighborhood Latitude',
                          'Neighborhood Longitude',
                          'Venue',
                          'Venue Latitude',
                          'Venue Longitude',
                          'Venue Category']
    
    return(nearby_food)
        

Now lets apply this function to our 25 DC Neighborhoods

In [13]:
LIMIT = 100

dc_food = getNearbyFood(names = dc_neighb['NEIGHBORHOOD'],
                       latitudes = dc_neighb['Latitude'],
                       longitudes = dc_neighb['Longitude']
                       )

Dupont Circle, Logan Circle, Shaw
Foggy Bottom
Penn Quarter, Mount Vernon Square, Howard U
Capitol Hill, H Street, Eckington, Trinidad, Kingman Park
Navy Yard
Federal Triangle
Downtown
Foggy Bottom - GWU - West End
Georgetown, Glover Park
Woodly Park, Cleveland Park
Adams Morgan
Columbia Heights
Petworth
Brightwood
Chevy Chase
Tenley Town, Spring Valley
Brookland, Michigan Park
Brentwood
Deanwood, Benning Heights, Fort Dupont
Anacostia
Southwest-Waterfront
Congress Heights


Lets check the shape of our new food dataset and look at the firs 5 rows.

In [14]:
print(dc_food.shape)
dc_food.head()

(672, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Dupont Circle, Logan Circle, Shaw",38.906778,-77.04148,CAVA,38.906639,-77.042132,Mediterranean Restaurant
1,"Dupont Circle, Logan Circle, Shaw",38.906778,-77.04148,Bub and Pop's,38.905712,-77.042335,Sandwich Place
2,"Dupont Circle, Logan Circle, Shaw",38.906778,-77.04148,Boqueria,38.905921,-77.04314,Spanish Restaurant
3,"Dupont Circle, Logan Circle, Shaw",38.906778,-77.04148,Iron Gate,38.906953,-77.040019,Mediterranean Restaurant
4,"Dupont Circle, Logan Circle, Shaw",38.906778,-77.04148,Nando's,38.906136,-77.041951,Portuguese Restaurant


Now lets group by each Neighborhood to see how many venues were returned for each

In [15]:
dc_food.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Adams Morgan,46,46,46,46,46,46
Anacostia,2,2,2,2,2,2
Brentwood,8,8,8,8,8,8
Brightwood,9,9,9,9,9,9
"Brookland, Michigan Park",12,12,12,12,12,12
"Capitol Hill, H Street, Eckington, Trinidad, Kingman Park",42,42,42,42,42,42
Chevy Chase,3,3,3,3,3,3
Columbia Heights,31,31,31,31,31,31
"Deanwood, Benning Heights, Fort Dupont",6,6,6,6,6,6
Downtown,93,93,93,93,93,93


Finally, lets see how many uniques food categories there are in total

In [16]:
print('There are {} unique categories.'.format(len(dc_food['Venue Category'].unique())))

There are 88 unique categories.


### D) Crime Date

I was able to find Washington DC Crime Data as well from the DC Government at the following link: https://dcatlas.dcgis.dc.gov/crimecards/

It has the past two years of reported crimes in Washington DC along with their coordinates.

In [17]:
dc_crime = pd.read_csv("C:/Users/Tim/Desktop/CapstoneProject/DC_Crimes_ZC.csv")
print(dc_crime.shape)
dc_crime = dc_crime[['offensegro','OFFENSE', 'LATITUDE', 'LONGITUDE', 'ZIPCODE']]
dc_crime.head()

(68573, 88)


  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,offensegro,OFFENSE,LATITUDE,LONGITUDE,ZIPCODE
0,property,theft/other,38.961971,-77.027959,20011
1,property,theft/other,38.902519,-77.015681,20001
2,property,theft/other,38.922212,-76.993152,20018
3,property,theft/other,38.901927,-77.039453,20006
4,property,theft f/auto,38.903732,-77.054268,20037


Now we want to group all of the reported crimes by their zipcode, returning only the zipcode and the total count

In [18]:
dc_crimetotal = dc_crime.groupby('ZIPCODE', axis = 0).count()

dc_crimetotal.reset_index(inplace = True)
dc_crimetotal = dc_crimetotal[['ZIPCODE', 'offensegro']]
dc_crimetotal.columns = ['ZipCode', 'Count']
dc_crimetotal

Unnamed: 0,ZipCode,Count
0,0,43
1,20001,7729
2,20002,8600
3,20003,3493
4,20004,1337
5,20005,2284
6,20006,620
7,20007,2875
8,20008,1475
9,20009,5198


Now lets merge in Neighborhood name as well, and remove excess ZipCodes

In [19]:
dc_crimetotal = pd.merge(dc_crimetotal, dc_zip, left_on = 'ZipCode', right_on = 'ZIPCODE')
dc_crimetotal = dc_crimetotal[['ZipCode', 'NEIGHBORHOOD', 'Count']]
dc_crimetotal.columns = ['ZipCode', 'Neighborhood', 'Count']
dc_crimetotal['ZipCode'] = dc_crimetotal['ZipCode'].astype(str)
dc_crimetotal.head()

Unnamed: 0,ZipCode,Neighborhood,Count
0,20001,"Penn Quarter, Mount Vernon Square, Howard U",7729
1,20002,"Capitol Hill, H Street, Eckington, Trinidad, K...",8600
2,20003,Navy Yard,3493
3,20004,Federal Triangle,1337
4,20005,Downtown,2284


Now we can map total crimes of the past two years to DC.  Luckily I was able to find a GeoJSON file at https://opendata.dc.gov/

In [20]:
#latitude and longitude for DC was defined on the previous map, so we will use those variables here

##Load in GeoJSON file for Neighborhood Borders
#neighb_map = pd.read_json("C:/Users/Tim/Desktop/CapstoneProject/zipcodes.json")
with open('C:/Users/Tim/Desktop/CapstoneProject/zipcodes.json') as json_file:
    neighb_map = json.load(json_file)

dc_map1 = folium.Map(location = [latitude, longitude], zoom_start = 12)


dc_map1.choropleth(
    geo_data = neighb_map,
    data = dc_crimetotal,
    columns = ['ZipCode', 'Count'],
    key_on = 'feature.properties.ZIP_CODE_T',
    fill_color = 'YlOrRd',
    fill_opacity = 0.7,
    line_opacity = 0.2,
    legend_name = 'Crime Rate in Washington DC',
    reset = True)

dc_map1

## Methodology <a name="methodology"></a>

## Analysis <a name="analysis"></a>

## Results and Discussion <a name="results"></a>

## Conclusion <a name="conclusion"></a>