<h1 align=center><font size = 5>New Construction and Potential for New Businesses in Neighborhoods in Palm Beach County </font></h1>

## Introduction

In this report, we take data from Palm Beach County Planning, Zoning and Building, 2019 Building Permit Reports, (found here http://discover.pbcgov.org/pzb/planning/Pages/Permit-Activity-Reports.aspx ) to find where the county is permitting new housing construction. Also, we will use the Foursquare API to explore neighborhoods in Palm Beach County.  We will do analysis to determine what venues maybe needed to service these new planned housing developments. We will use the **explore** function to get the most common venue categories in each neighborhood, and then use this feature to group the neighborhoods into clusters. You will use the *k*-means clustering algorithm to complete this task. Finally, we will use the Folium library to visualize the neighborhoods in Palm Beach County and their emerging clusters.  

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Download and Explore Dataset</a>

2. <a href="#item2">Explore Neighborhoods in Palm Beach County</a>

3. <a href="#item3">Analyze Each Neighborhood</a>

4. <a href="#item4">Cluster Neighborhoods</a>

5. <a href="#item5">Examine Clusters</a>    
</font>
</div>

Import libraries 

In [None]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import json  #library to handle json files
import random # library for random number generation

#scraping pdf
!pip install tabula-py
!pip install tabulate
import tabula 
import tabulate

#excell files
import xlrd


!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt


# import k-means for clustering
from sklearn.cluster import KMeans

#libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
#tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

print('Folium installed')
print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: - 

In [None]:
import os
print (os.getcwd())

We created a new excell dataframe which cordinates the Cities in the Municipality columns with their cordinates found by this website www.lat-long.com or https://www.findlatitudeandlongitude.com/ or latlong.net

In [None]:
lat_lon = pd.read_excel(r'/resources/labs/DP0701EN/Palm Beach County Cities and Zips.xlsx')
lat_lon.head()
                        

For the pdf dataframe ... we used the tabula technology to extract the dataframe to a cvs file https://tabula.technology/

In [None]:
PAR = pd.read_csv(r'/resources/labs/DP0701EN/tabula-4thQuarterPermitActivityReport.csv', sep=',', header=None,  names = ["MUNICIPALITIES", "SINGLE FAMILY UNITS", "SFU VALUE", "MULTI FAMILY UNITS", "MFU VALUE", "TOTAL UNITS", "TOTAL UNITS VALUE"])
PAR.head()
                        

In [None]:
#sorting values by total units and take the top 5 Municipalities of new construction to look at. 
Top5 =PAR.sort_values(by = 'TOTAL UNITS', ascending = False).head()
Top5


merge dataframes

In [None]:
#merge dataframes
PBC_df= pd.merge( Top5,lat_lon, on='MUNICIPALITIES')
PBC_df

EXPLORE AND CLUSTER 

In [None]:
## MAP OF TOP 5 Municipalities of new construction

In [None]:
address = 'Palm Beach County'

geolocator = Nominatim(user_agent="PBC_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Palm Beach County are {}, {}.'.format(latitude, longitude))

In [None]:
map_PBC = folium.Map(location=[latitude,longitude],zoom_start=10)

for lat,lng,municipality in zip(PBC_df['Latitude'],PBC_df['Longitude'],PBC_df['MUNICIPALITIES']):
    label = '{}'.format(municipality)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
    [lat,lng],
    radius=2,
    popup=label,
    color='blue',
    fill=True,
    fill_color='#3186cc',
    fill_opacity=0.7,
    parse_html=False).add_to(map_PBC)
map_PBC

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

## Define Foursquare Credentials and Version

In [None]:
CLIENT_ID = 'LAUAGY5VQH2DJ4VUXN4OXSNGEGCKT0TLXSDASO4FL1XB4SES' # your Foursquare ID
CLIENT_SECRET = 'D3CTWFSGB2D5XW0DQ02ZB2VGVOI2IZMI0ISACKDSVCL0MEV2' # your Foursquare Secret
VERSION = '20180605'

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

## Let's explore the neighborhood of Palm Beach County



get venues

In [None]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 100000 # define radius

In [None]:
# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude, 
    longitude, 
    radius, 
    LIMIT)
url # display URL


send Get request to examn the results

In [None]:
results = requests.get(url).json()

In [None]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

# Clean Data in JSon File put it in panda dataframe 

In [None]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['MUNICIPALITIES', 
                  'Municipality Latitude', 
                  'Municipality Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
PBC_venues = getNearbyVenues(names=PBC_df['MUNICIPALITIES'],
                                   latitudes=PBC_df['Latitude'],
                                   longitudes=PBC_df['Longitude']
                                  )

Get new dataframe with venues

In [None]:
PBC_venues.head()

Group by Municipality

In [None]:
PBC_venues.groupby('MUNICIPALITIES').count().head()

In [None]:
print('There are {} uniques categories.'.format(len(PBC_venues['Venue Category'].unique())))

Merge dataframes to get the new family units built by each Municipalities


In [None]:
NewHousingPBC = pd.merge(PAR, PBC_venues, on='MUNICIPALITIES')
NewHousingPBC.head()

# Analyze  Each Neighborhood

In [None]:
# one hot encoding
PBC_onehot = pd.get_dummies(PBC_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
PBC_onehot['MUNICIPALITIES'] = PBC_venues['MUNICIPALITIES'] 

# move neighborhood column to the first column
fixed_columns = [PBC_onehot.columns[-1]] + list(PBC_onehot.columns[:-1])
PBC_onehot = PBC_onehot[fixed_columns]

PBC_onehot.head()

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [None]:
PBC_grouped = PBC_onehot.groupby('MUNICIPALITIES').mean().reset_index()
PBC_grouped.head()

#### Let's print each neighborhood along with the top 5 most common venues

In [None]:
num_top_venues = 5

for hood in PBC_grouped['MUNICIPALITIES']:
    print("----"+hood+"----")
    temp = PBC_grouped[PBC_grouped['MUNICIPALITIES'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

In [None]:
#function to sort in decending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['MUNICIPALITIES']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['MUNICIPALITIES'] = PBC_grouped['MUNICIPALITIES']

for ind in np.arange(PBC_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(PBC_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

## Cluster Neighborhoods

Run *k*-means to cluster the neighborhood into 5 clusters.

In [None]:
# set number of clusters
kclusters = 5

PBC_grouped_clustering = PBC_grouped.drop('MUNICIPALITIES', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(PBC_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

In [None]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

PBC_merged = PBC_df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
PBC_merged = PBC_merged.join(neighborhoods_venues_sorted.set_index('MUNICIPALITIES'), on='MUNICIPALITIES')

PBC_merged.head() # check the last columns!

Finally, let's visualize the resulting clusters

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(PBC_merged['Latitude'], PBC_merged['Longitude'], PBC_merged['MUNICIPALITIES'], PBC_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

#### Let's print each neighborhood along with the 20 least common venues

In [None]:
num_least_venues = 5

for hood in PBC_grouped['MUNICIPALITIES']:
    print("----"+hood+"----")
    temp = PBC_grouped[PBC_grouped['MUNICIPALITIES'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=True).reset_index(drop=True).head(num_least_venues))
    print('\n')

In [None]:
#function to sort in decending order
def return_least_common_venues(row, num_least_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=True)
    
    return row_categories_sorted.index.values[0:num_least_venues]

In [None]:
num_least_venues = 20

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['MUNICIPALITIES']
for ind in np.arange(num_least_venues):
    try:
        columns.append('{}{} Least Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Least Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['MUNICIPALITIES'] = PBC_grouped['MUNICIPALITIES']

for ind in np.arange(PBC_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_least_common_venues(PBC_grouped.iloc[ind, :], num_least_venues)

neighborhoods_venues_sorted.head()

Let's decide which venues are nessisary and look at what Municipalities of new development they are lacking.   In order to suggest these services to move into the neighborhood.

# lets look at venue categories to get a better idea


In [None]:
VenueC = PBC_venues['Venue Category'].unique()
VenueC

### What we see is there is a restraurant column and then many different kinds of restaurants under there type in their own column Also there is a Bar and a Pub and Wine Bar and Lounge and Brewery which seems the same.   This means the Venue Category is confusing


In [None]:
new_column= np.array(['Clothing', 'Hospitality', 'Park', 'Restaurant', 'Clothing', 'Bus', 'Business Service', 'Restaurant','Hobby Shop', 'Restaurant','Restaurant','Bar','Restaurant','Cafe', 'Bar', 'Restaurant','Bar','Restaurant','Grocery','Restaurant','Gym','Nightclub','Restaurant','Bar','Restaurant','Nightclub','Cafe','Theater','Restaurant','Restaurant','Park', 'Restaurant','Train','Restaurant','Bar','Bar','Park','Restaurant','Grocery','Road','Restaurant','Bank','Theater','Park','Pharmacy','Theater','Restaurant','Grocery','Gym','Museum','Cafe','Park','Construction','Park','Intersection','Park','Restaurant','Bar','Grocery','Restaurant','Gas Station','Business Service','Clothing','Hospitality','Grocery'])
dataset = pd.DataFrame({'Venue Category': VenueC, 'Category': new_column}, columns=['Venue Category', 'Category'])
dataset

This new dataframe has narrowed down the search because there were multiple of the same services in different categories

In [None]:
CategoryPBC = pd.merge(NewHousingPBC, dataset, on='Venue Category')
CategoryPBC

In [None]:
CategoryPBC.groupby('Category').count().head()

In [None]:
small_df=CategoryPBC.loc[:,['MUNICIPALITIES','Category','Venue Category','Venue','TOTAL UNITS']]
small_df.head()


# DATA FRAMES FOR EACH CITY WITH TYPES OF CATEGORY AND COUNTS OF EACH CATEGORY AND TOTAL NEW UNITS BEING BUILT IN CITY

In [None]:
small_df['MUNICIPALITIES'].unique()

## BOYNTON BEACH

In [None]:
Boynton_Beach=small_df[small_df['MUNICIPALITIES']=='Boynton Beach']
Boynton_Beach_c=Boynton_Beach['Category'].value_counts().to_frame(name='Count')
Boynton_Beach

In [None]:
Boynton_Beach_c

In [None]:
Boynton_Beach_c.plot()

## PALM BEACH GARDENS

In [None]:
Palm_Beach_Gardens=small_df[small_df['MUNICIPALITIES']=='Palm Beach Gardens']
Palm_Beach_Gardens_c=Palm_Beach_Gardens['Category'].value_counts().to_frame(name='Count')
Palm_Beach_Gardens

In [None]:
Palm_Beach_Gardens_c.plot()
Palm_Beach_Gardens_c

## WEST PALM BEACH

In [None]:
West_Palm_Beach=small_df[small_df['MUNICIPALITIES']=='West Palm Beach']
West_Palm_Beach_c=West_Palm_Beach['Category'].value_counts().to_frame(name='Count')
West_Palm_Beach

In [None]:
West_Palm_Beach_c

In [None]:
West_Palm_Beach_c.plot()


## PALM BEACH COUNTY UNINCORPORATED AREA

In [None]:
PBC_Unincorporated=small_df[small_df['MUNICIPALITIES']=='Palm Beach County Unincorporated Area']
PBC_Unincorporated_c=PBC_Unincorporated['Category'].value_counts().to_frame(name='Count')
PBC_Unincorporated

In [None]:
PBC_Unincorporated_c

In [None]:
PBC_Unincorporated_c.plot()

## WESTLAKE

In [None]:
Westlake=small_df[small_df['MUNICIPALITIES']=='Westlake']
Westlake=Westlake.drop(columns=['MUNICIPALITIES'])
Westlake_c=Westlake['Category'].value_counts().to_frame(name='Count')
Westlake_c


# CONCLUSION

After observing the data given regarding the venues surrounding the new housing developments in Palm Beach County where the 5 most areas are being developed in the 4th quarter of 2019.  It seems that there is a need for a lot of commercial development needed.  Yet since I live in Palm Beach County, I am sure that the foursquare data is incorrect and there are much more venues then what is listed.  Therefore I would not make recommendations using that dataset. 

by Heidi Peterson