In [1]:
import pandas as pd #Importing pandas library
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import numpy as np #Importing NumPy

from bs4 import BeautifulSoup #Importing BeautifulSoup 
import requests #The requests function will allow for data scraping

!pip install folium #Installing Folium for creating maps
!pip install geopy #GeoPy lets us easily locate and use map coordinates
!pip install wikipedia #Python has a function that searches Wikipedia pages and allows us to scrape tables
from geopy.geocoders import Nominatim #Nominatim lets us search OSM data by name and address
import folium
import wikipedia as wp

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes


In [5]:
html = wp.page("Boston").html().encode("UTF-8") #Searching for the article on Boston
try:
    df_demographics = pd.read_html(html)[9] #The table we are looking for, "Demographic breakdown by ZIP Code", is the 10th table in the article. For simplicity's sake, I counted and tried this function until I got said table
except IndexError:
    df_demographics = pd.read_html(html)[0]
df_demographics #Checking to see if we have the correct table

Unnamed: 0,Rank,ZIP code (ZCTA),Per capitaincome,Medianhouseholdincome,Medianfamilyincome,Population,Number ofhouseholds
0,1.0,02110 (Financial District),"$152,007","$123,795","$196,518",1486,981
1,2.0,02199 (Prudential Center),"$151,060","$107,159","$146,786",1290,823
2,3.0,02210 (Fort Point),"$93,078","$111,061","$223,411",1905,1088
3,4.0,02109 (North End),"$88,921","$128,022","$162,045",4277,2190
4,5.0,02116 (Back Bay/Bay Village),"$81,458","$87,630","$134,875",21318,10938
5,6.0,02108 (Beacon Hill/Financial District),"$78,569","$95,753","$153,618",4155,2337
6,7.0,02114 (Beacon Hill/West End),"$65,865","$79,734","$169,107",11933,6752
7,8.0,02111 (Chinatown/Financial District/Leather Di...,"$56,716","$44,758","$88,333",7616,3390
8,9.0,02129 (Charlestown),"$56,267","$89,105","$98,445",17052,8083
9,10.0,02467 (Chestnut Hill),"$53,382","$113,952","$148,396",22796,6351


As we can see in the above table, there is a lot of useful data on the neighborhoods of Boston. Per Capita Income is a good measure on how much money on average people in these neighborhoods tend to make; a high per capita means more money, which in turn means a higher likelihood for our venue to make a better profit.

However, there is a lot of data that isn't necessary. You can see there are rows for the entirety of Massachusetts, Boston, the United States, and Suffolk County (the county that Boston is a part of). Additionally, there are columns of information we don't need: Rank is redundant, and we aren't interested in Median Household Income nor Median Family Income. Number of Households may be useful, as an area may have a high per capita income without having a lot of households, so we can keep that column.

For this next part, we will clean up the data.

In [19]:
df2 = df_demographics.drop([15, 16, 17, 20]) #Dropping the rows with Boston, Suffolk County, MA, and the United States
df3 = df2.drop(['Rank','Medianhouseholdincome','Medianfamilyincome'], axis = 1) #Dropping the aforementioned columns of data we do not need
df3[['ZIP Code','Neighborhood']] = df_demographics_clean['ZIP code (ZCTA)'].str.split(' ', 1, expand = True) #Splitting the ZIP code and neighborhood name, as the column has them together
df3.drop(['ZIP code (ZCTA)'], axis = 1) #Dropping the now redundant column
df_clean=df3[['ZIP Code', 'Neighborhood','Per capitaincome', 'Population', 'Number ofhouseholds']] #Reorganizing the columns into a better order
df_clean

Unnamed: 0,ZIP Code,Neighborhood,Per capitaincome,Population,Number ofhouseholds
0,2110,(Financial District),"$152,007",1486,981
1,2199,(Prudential Center),"$151,060",1290,823
2,2210,(Fort Point),"$93,078",1905,1088
3,2109,(North End),"$88,921",4277,2190
4,2116,(Back Bay/Bay Village),"$81,458",21318,10938
5,2108,(Beacon Hill/Financial District),"$78,569",4155,2337
6,2114,(Beacon Hill/West End),"$65,865",11933,6752
7,2111,(Chinatown/Financial District/Leather District),"$56,716",7616,3390
8,2129,(Charlestown),"$56,267",17052,8083
9,2467,(Chestnut Hill),"$53,382",22796,6351


Now that we have a cleaned data frame, we can compare the neighborhood statistics to one another. 

At first glance, one may believe that the Financial District is the most profitable area for a new venue, as the per capita income is higher than any other neighborhood. However, as we kept the population and number of households, we can see that the Financial District sports one of the lowest populations as well as number of households.

From the data frame alone, the most appealing location appears to be the Back Bay/Bay Village neighborhood. While it is 5th for per capita income, it has a very high population compared to the top 10 per capita income neighborhoods, in addition to a much higher concentration of households. 

We seem to have found a potential area for a venue. Let's go ahead and create and map and look at the city and Back Bay.

In [33]:
address = 'Boston, Massachusetts' #Defining the address for Nominatim 

geolocator = Nominatim(user_agent="boston_explorer")
location = geolocator.geocode(address) #Defining the location
latitude = location.latitude #The latitude of Boston
longitude = location.longitude #The longitude
print('The geograpical coordinates of Boston is {}, {}.'.format(latitude, longitude)) #Getting the coordinates of Boston

The geograpical coordinates of Boston is 42.3602534, -71.0582912.


In [32]:
address = 'Back Bay, Boston' #Defining the address for Nominatim 

geolocator = Nominatim(user_agent="boston_explorer")
location = geolocator.geocode(address) #Defining the location
latitude = location.latitude #The latitude of Boston
longitude = location.longitude #The longitude
print('The geograpical coordinates of Back Bay/Bay Village is {}, {}.'.format(latitude, longitude)) #Getting the coordinates of the Back Bay neighborhood

The geograpical coordinates of Back Bay/Bay Village is 42.35054885, -71.08031131584724.


As we are only interested in the Back Bay neighborhood, we only need the lat/long for that neighborhood. Here we can put it into a simple dataframe so we can add a marker in the Folium map of Boston.

In [50]:
df_back_bay = pd.DataFrame({'Latitude':42.35054885, 'Longitude':-71.08031131584724, 'Borough': 'Back Bay', 'Neighborhood': 'Back Bay'}, index=[0])
df_back_bay

Unnamed: 0,Latitude,Longitude,Borough,Neighborhood
0,42.350549,-71.080311,Back Bay,Back Bay


In [51]:
map_boston = folium.Map(location=[latitude, longitude], zoom_start=12)

for lat, lng, borough, neighborhood in zip(df_back_bay['Latitude'], df_back_bay['Longitude'], df_back_bay['Borough'], df_back_bay['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_boston)
    
map_boston

Back Bay is considered its own neighborhood and is not part of a larger borough. As such, let's go ahead and use Foursquare API to explore the venues in the area and see if we can determine what a good venue may be in the area.

In [52]:
back_bay_data = df_back_bay[df_back_bay['Borough'] == 'Back Bay'].reset_index(drop=True)

Unnamed: 0,Latitude,Longitude,Borough,Neighborhood
0,42.350549,-71.080311,Back Bay,Back Bay


In [53]:
# The code was removed by Watson Studio for sharing.

In [54]:
def getNearbyVenues(names, latitudes, longitudes, radius=500): #defining nearby venues in Back Bay
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [55]:
back_bay_venues=getNearbyVenues(names=df_back_bay['Neighborhood'],
                                latitudes=df_back_bay['Latitude'],
                                longitudes=df_back_bay['Longitude']
                                )

Back Bay


In [58]:
back_bay_onehot = pd.get_dummies(back_bay_venues[['Venue Category']], prefix="", prefix_sep="")

back_bay_onehot['Neighborhood'] = back_bay_venues['Neighborhood'] 

fixed_columns = [back_bay_onehot.columns[-1]] + list(back_bay_onehot.columns[:-1]) # Move neighborhood column to the first column
back_bay_onehot = back_bay_onehot[fixed_columns]

back_bay_grouped = back_bay_onehot.groupby('Neighborhood').mean().reset_index() 
back_bay_grouped

Unnamed: 0,Neighborhood,Accessories Store,American Restaurant,Asian Restaurant,Athletics & Sports,Bagel Shop,Bank,Boutique,Brazilian Restaurant,Burger Joint,Café,Chocolate Shop,Clothing Store,Coffee Shop,Cosmetics Shop,Cupcake Shop,Cycle Studio,Deli / Bodega,Department Store,Dessert Shop,Electronics Store,Farmers Market,Fast Food Restaurant,French Restaurant,Gift Shop,Gourmet Shop,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Hotel,Ice Cream Shop,Italian Restaurant,Jewelry Store,Juice Bar,Lounge,Mediterranean Restaurant,Mexican Restaurant,Park,Pet Store,Pizza Place,Plaza,Ramen Restaurant,Restaurant,Salad Place,Salon / Barbershop,Seafood Restaurant,Shopping Mall,Southern / Soul Food Restaurant,Spa,Sporting Goods Shop,Steakhouse,Thai Restaurant,Tour Provider,Trail,Vietnamese Restaurant,Women's Store
0,Back Bay,0.02,0.04,0.01,0.01,0.01,0.01,0.02,0.01,0.01,0.01,0.02,0.07,0.06,0.03,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.02,0.01,0.02,0.01,0.04,0.04,0.04,0.01,0.03,0.01,0.02,0.01,0.01,0.01,0.01,0.02,0.01,0.02,0.02,0.01,0.04,0.03,0.01,0.02,0.02,0.02,0.01,0.01,0.01,0.01,0.01


There we can see the frequency of each type of venue that is present in the Back Bay neighborhood. It is important for us to not pick on that has a large amount of one type of restaurant, as the competition would make it more difficult for our prospective restaurant to make a profit. Thus, let's see what the 10 most common types of venue is in the area.

In [61]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = back_bay_grouped['Neighborhood']

for ind in np.arange(back_bay_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(back_bay_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Back Bay,Clothing Store,Coffee Shop,American Restaurant,Seafood Restaurant,Hotel,Ice Cream Shop,Italian Restaurant,Cosmetics Shop,Shopping Mall,Juice Bar


Now we have an idea what type of venues to not pursue. As we are interested in specifically opening a restaurant as opposed to any type of venue, we can ignore clothing stores, hotels, cosmestics shops, and shopping malls. As the later two venues tend to be very expensive to construct and establish, we have all the more reason to discount these.

Next, let's look at restaurants. Coffee shops are the most popular restaurant, which implies that competiton would be fierce. American and seafood restaurants are also rather popular, which may dissuade us from pursing these.

Ice cream shops, while not sporting as much competition and possessing lower startup costs and overhead, still may not be the best choice. As Boston is located in the north east of the United States, ice cream is not exactly popular come fall, winter, and early spring. As such, an ice cream shop is a seasonal venue at best. Juice bars, as they tend to serve smoothies, have the same issue, albeit to a lesser degree.

This leaves what would probably be the best option: an Italian restaurant. It is not a common enough venue to pose a threat of competition, while being popular enough to ensure there is consumer interest in the area. We also do not need to worry about losing profits in the winter due to the seasonal nature of the venue. Thus, it appears our target area would be Boston's Back Bay neighborhood, and our target restaurant would be an Italian restaurant. 