### <p style="text-align: center;">K-Means Cluster Analysis of Venues in the Bronx and Staten Island</p> 
#### <p style="text-align: center;">A Data Science Project</p> 
##### <p style="text-align: center;">by David Anim-Addo</p>

### Introduction
A review of research in public health shows a relationship between the socioeconomics of an urban environment and the well-being of its inhabitants.  This is supported by evidence that residents with access to first-class resources have decreased incidence and prevalence of disease as well as overall better health.  For example, in a paper by Diez Roux (2003) the influence of neighborhood environments on cardiovascular health is examined.  The author mentions that residing in lower income neighborhoods places residents at a higher risk for coronary heart disease.  Socioeconomic characteristics have also been linked to other health factors such as smoking, diet, and body mass index.  In documenting this problem Diez Roux lists a number of neighborhood features linked to cardiovascular risk.  Among these features are access to healthy foods, recreational resources, and transportation.  The importance of geographically coded information is discussed as it allows for a wide variety of public health measures to be studied.  

More recently, in an investigation by Tabaei, et al. (2018) researchers studied the influence of socioeconomics, food options, and the quality of the built residential area, such as parks and recreation, on glycemic control in adults with diabetes in New York City.  Their findings revealed that better diabetes control was associated with residents in advantaged neighborhoods with access to high-quality environmental resources.  In addition, a trend was observed as individuals with poor glycemic control moved from disadvantaged neighborhoods to neighborhoods with more resources where their health status improved.  

In both papers, the need for further research on neighborhood features and their impact on public health is expressed.  The construction of the neighborhood environment and its available resources can have long-term health effects on its residents.  The question then becomes about the type of analytical methods useful for gaining further insight in this field.  For example, what kind of venues exist across New York neighborhoods with different socioeconomics?  The answer to this question could be a starting point for further research in public health.  

The goal of this project is to use K-Means cluster analysis to examine the venue distribution of two New York boroughs, Staten Island and the Bronx.  These areas were selected based on U.S. Census Bureau income and poverty data in order to compare regions of different socioeconomic levels.  The venues for this analysis were acquired through the Foursquare API which has nine categories: arts and entertainment, college and education, events, food, nightlife, outdoors and recreation, professional, residence, shops, and traveling.  With these categories, neighborhoods in Staten Island and the Bronx will be segmented and clustered to observe how these venues are distributed across the different residential areas.  Afterwards, any variations in clusters between the two regions will be explored. 

### Data
The socioeconomic data for this project was collected from the U.S. Census Bureau on New York City income and poverty data:  [U.S. Census Bureau QuickFacts: New York](https://www.census.gov/quickfacts/fact/table/newyorkcountymanhattanboroughnewyork,bronxcountybronxboroughnewyork,queenscountyqueensboroughnewyork,kingscountybrooklynboroughnewyork,richmondcountystatenislandboroughnewyork,newyorkcitynewyork/HSG010218)

This data was used to identify New York boroughs of differing socioeconomic levels for venue cluster analysis.  Staten Island has a median household income of 76,244 dollars and the Bronx has a median household income of 36,593 dollars. 


The beautiful soup data scraping library will be used to acquire a list of neighborhoods in both regions from these Wikipedia pages: 

[List of Bronx Neighborhoods](https://en.wikipedia.org/wiki/List_of_Bronx_neighborhoods)

[List of Staten Island Neighborhoods](https://en.wikipedia.org/wiki/List_of_Staten_Island_neighborhoods)


In [None]:
#Assign the website to a variable and request it
url = 'https://en.wikipedia.org/wiki/List_of_Bronx_neighborhoods'
rq = requests.get(url)

#Use beautifulsoup as bs to webscrape the neighborhoods
soup = bs(rq.content,'html')
name = soup.find_all('a')[0] 
df = pd.read_html(str(table))
df2 = df[0]
df2.head(15)


The geopy library will then be used to find the longitude and latitude of neighborhoods in each borough.  

In [None]:
locator = nm(user_agent='NY_Geocoder')

dflat=[]
dflong=[]
BACKOFF_TIME=3

for i in dfc['Neighbourhood']:
    location = locator.geocode('{}, New York City, New York'.format(i))
    if location == None:
        lat = pd.DataFrame({'Latitude': ['Nan']})  
        long = pd.DataFrame({'Longitude': ['Nan']})
        dflat.append(lat)
        dflong.append(long)
    else:
        lat = pd.DataFrame({'Latitude': [location.latitude]})  
        long = pd.DataFrame({'Longitude': [location.longitude]})
        dflat.append(lat)
        dflong.append(long)
        time.sleep(BACKOFF_TIME * 1)

Finally, venue category data will be supplied through the Foursquare API.  

In [None]:
url = 'https://api.foursquare.com/v2/venues/categories?client_id={}&client_secret={}&v={}'.format(CLIENT_ID, CLIENT_SECRET, VERSION)
results = requests.get(url).json()

venues = results['response']['venues']

# Use the json_normalize function to flatten the JSON file
nearby_venues = json_normalize(venues) 

# Filter the columns for the nearby venues
filtered_columns = ['name', 'categories', 'location.lat', 'location.lng']
nearby_venues = nearby_venues.loc[:, filtered_columns]

# This is a function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

# Filter the category for each row
nearby_venues['categories'] = nearby_venues.apply(get_category_type, axis=1)

# Clean the columns and check the table for proper titles
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

#### References

Diez Roux, A. V. (2003). Residential Environments and Cardiovascular Risk. Journal of Urban Health, 80(4), 569–589. https://doi.org/10.1093/jurban/jtg065

Tabaei, B. P., Rundle, A. G., Wu, W. Y., Horowitz, C. R., Mayer, V., Sheehan, D. M., & Chamany, S. (2018). Associations of Residential Socioeconomic, Food, and Built Environments with Glycemic Control in Persons with Diabetes in New York City from 2007-2013. American Journal of Epidemiology, 187(4), 736–745. https://doi.org/10.1093/aje/kwx300
