# IBM Data Science Capstone: Finding the Most Compatible Neighborhood

## Part 1: Introduction/Business Problem

The target audience for this project is newcomers to a city or those thinking of moving to a city.

The purpose of the project is to help the target audience identify promising neighborhoods where they should focus their housing search.

The user will provide their preferred neighborhood attributes as input, and the model will output a list of neighborhoods most compatible with those preferences.

As a first iteration we limit the city to Portland, Oregon, but ideally we'll expand the model to include any U.S. city or zip code of the user's choosing.

## Part 2: Data Required

The data required will be a list of Portland neighborhoods and zip codes and their corresponding latitudes and longitudes. The list below is scraped from the website: www.portlandneighborhood.com.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [6]:
df=pd.read_html('https://portlandneighborhood.com/portlandzipcodes')
dfn=df[0]
dfn.head()

Unnamed: 0,0,1,2
0,View the Alameda Neighborhood profile,97212,
1,View the Arbor Lodge Neighborhood profile,97217,
2,Ardenwald,97222,
3,View the Argay Neighborhood profile,97230,
4,View the Arlington Heights Neighborhood profile,97201,


Once the list of neighborhoods is cleaned up, a geocoder will be used to record the corresponding lattitudes and longitudes for each neighborhood, and those will be appended to the cleaned dataframe. An example of retrieving the location data is shown below.

In [10]:
!pip install geopy
from geopy.geocoders import Nominatim



In [11]:
address_example=97212
geolocator = Nominatim(user_agent="portland_explorer")
location = geolocator.geocode(address_example)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of this Portland neighborhood are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of this Portland neighborhood are 45.544455550640166, -122.64324085432656.


The location data will then be used to query Foursquare in order to collect data on the types of venues available in each neighborhood, using a user-defined radius for the queries.

The venue information for each neighborhood will be organized so that the most common types of venues are reported in order. Once this step has been accomplished, the neighborhoods will be sent to a clustering algorithm along with a "user-defined neighborhood" that consists of a fictitious neighborhood constructed based on the users preferences.

The cluster that includes the user-defined neighborhood will be presented, along with a map of the neighborhoods that are in the cluster.