# BUSINESS OBJECTIVE AND PROJECT LAYOUT

## BUSINESS OBJECTIVE

Kansas city is among the top 10 cities within the country for startups. A major aspect of starting any new business is deciding on location!
In this project a new business owner wants to start an arcade in Kansas City. Here I will use Foursquare's API to identify successful arcades in model cities, like Toronto and New York, and correlate relative success to popular, near-by venues.

## ACQUIRING THE DATA

### General Data
 - Top 10 cities for startup businesses
 - Neighborhoods for city 1, city 4, city 7, and Kansas City, MO
 
### Arcade Data
 - Top 20 arcades for city 1, city 4, city 7, and Kansas City, MO
 - Geo locations and neighborhoods for each
 
### Venue Data
 - Top 50 venues for each of the 20 arcades for each city
 - Geo locations and neighborhoods for each
 - Group venues by neighborhood
 - Group venues by type and 'score'
 
 

## ANALYZING THE DATA

In [4]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

import re #regular expessions

! pip install beautifulsoup4
! pip install requests
import requests #request online data
from bs4 import BeautifulSoup as bs # library for navigating html files

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Libraries imported.


### LIST OF TOP US STARTUP CITIES

Below is a link to Startup Genome's global startup ecosystem for 2020:

[TOP 30 CITIES](https://startupgenome.com/article/rankings-top-40)

In [19]:
#Visually mined the 12 US startup cities.
STARTUP_DATA = {'RANKING':[1,2,3,4,5,6,7,8,9,10,11,12], 'CITY':['Silicon Valley, CA', 'New York City, NY', 'Boston, MA', 'Los Angeles, CA', 'Seattle, WA', 'Washington DC', 'Chicago, IL', 'Austin, TX', 'Atlanta, GA', 'Denver-Boulder, CO', 'Dallas, TX', 'Miami, FL']}
US_STARTUPS_CITIES = pd.DataFrame(data=STARTUP_DATA)
US_STARTUPS_CITIES = US_STARTUPS_CITIES.set_index('RANKING')
US_STARTUPS_CITIES

Unnamed: 0_level_0,CITY
RANKING,Unnamed: 1_level_1
1,"Silicon Valley, CA"
2,"New York City, NY"
3,"Boston, MA"
4,"Los Angeles, CA"
5,"Seattle, WA"
6,Washington DC
7,"Chicago, IL"
8,"Austin, TX"
9,"Atlanta, GA"
10,"Denver-Boulder, CO"


In [20]:
#CHOOSE CITIES TO ANALYZE
data =US_STARTUPS_CITIES.iloc[[1,4,8,10]]
df_cities = pd.DataFrame(data= data)
df_cities

Unnamed: 0_level_0,CITY
RANKING,Unnamed: 1_level_1
2,"New York City, NY"
5,"Seattle, WA"
9,"Atlanta, GA"
11,"Dallas, TX"


In [5]:
df_cities.to_csv('analysis_cities_df.csv')

## SCRAPE LOOPS

### LINKS to the Neighborhood data
[Seattle, WA](https://en.wikipedia.org/wiki/List_of_neighborhoods_in_Seattle)   
[Atlanta, GA](https://www.atlantaga.gov/government/departments/city-planning/office-of-zoning-development/neighborhood-planning-unit-npu/neighborhoods-by-npu)   
[Dallas, TX](https://www.dallas.com/neighborhoods)   
[Kansas City, MO](https://en.wikipedia.org/wiki/Neighborhoods_of_Kansas_City,_Missouri)

In [5]:
#download NewYork json file
!wget -q -O 'newyork_data.json' https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json
json_data = open('newyork_data.json')
print('Data downloaded!')

Data downloaded!


In [6]:
seattle_doc = requests.get('https://en.wikipedia.org/wiki/List_of_neighborhoods_in_Seattle')
atlanta_doc = requests.get('https://www.atlantaga.gov/government/departments/city-planning/office-of-zoning-development/neighborhood-planning-unit-npu/neighborhoods-by-npu')
dallas_doc = requests.get('https://www.dallas.com/neighborhoods')
KC_doc= requests.get('https://en.wikipedia.org/wiki/Neighborhoods_of_Kansas_City,_Missouri')
newyork_data = json.load(json_data)

In [7]:
seattle_soup = bs(seattle_doc.content)
atlanta_soup = bs(atlanta_doc.content)
dallas_soup = bs(dallas_doc.content)
KC_soup = bs(KC_doc.content)

### SCRAPING SEATLE

In [21]:
seattle_tables = seattle_soup.select("tbody")[0]
seattle_columns = seattle_tables.find("tr").find_all("th")
seattle_column_names = [str(c.get_text()).strip() for c in seattle_columns]
seattle_column_names.remove(seattle_column_names[0])
seattle_column_names

['Neighborhood name',
 'Within larger district',
 'Annexed[41]',
 'Locator map',
 'Street map',
 'Image',
 'Notes']

In [22]:
seattle_rows = seattle_tables.find_all("tr")
seattle_df_by_row = list()
for tr in seattle_rows:
    td_find_all= tr.find_all("td")
    td_list = list()
    for td in td_find_all:
        row_td = [str(td.get_text()).strip()]
        row_td = str(row_td[0])
        word_list = re.findall('(\w*\s*\w*\s*\w*)', row_td) #I had to count the words -_-
        td_list.append(str(word_list[0]))
    seattle_df_by_row.append(td_list)
print('it works')

it works


In [28]:
seattle_df = pd.DataFrame(seattle_df_by_row, columns = seattle_column_names)
seattle_df.head()

Unnamed: 0,Neighborhood name,Within larger district,Annexed[41],Locator map,Street map,Image,Notes
0,,,,,,,
1,North Seattle,Seattle,Various,,,,North of the
2,Broadview,North Seattle,1954,,,,
3,Bitter Lake,North Seattle,1954,,,,
4,North Beach,North Seattle,1940,,,,


In [29]:
seattle_df = seattle_df.drop([0], axis = 0)
seattle_df.head()

Unnamed: 0,Neighborhood name,Within larger district,Annexed[41],Locator map,Street map,Image,Notes
1,North Seattle,Seattle,Various,,,,North of the
2,Broadview,North Seattle,1954,,,,
3,Bitter Lake,North Seattle,1954,,,,
4,North Beach,North Seattle,1940,,,,
5,Crown Hill,North Seattle,1907,,,,


In [30]:
seattle_df = seattle_df.drop(['Annexed[41]','Locator map', 'Street map', 'Image', 'Notes'], axis = 1)
seattle_df = seattle_df.set_index('Within larger district')
seattle_df.head()

Unnamed: 0_level_0,Neighborhood name
Within larger district,Unnamed: 1_level_1
Seattle,North Seattle
North Seattle,Broadview
North Seattle,Bitter Lake
North Seattle,North Beach
North Seattle,Crown Hill


In [31]:
to_csv =seattle_df.to_csv('seattle_data.csv') #stored data locally as a csv

## Scraping Atlanta

In [89]:
atlanta_tables = atlanta_soup.select('tbody')[0]
atlanta_neighborhoods = atlanta_tables.find('tr').find('td').find_all('p')

In [91]:
atlanta_paragraphs=list()
atlanta_breaks=list()
for p in atlanta_neighborhoods:
    p=str(p)
    break_list= re.findall('(.*)<br/>', p)
    for br in break_list:
        br=br.strip()
        br=br.strip('<p>')
        br=br.strip('</p>')
        atlanta_breaks.append(br)
print(atlanta_breaks)
len(atlanta_breaks)

['Chastain Park', 'Kingswood', 'Margaret Mitchell', 'Mt. Paran Parkway', 'Mt. Paran/Northside', 'Paces', 'Pleasant Hill', 'Randall Mill', 'Tuxedo Park', 'West Paces Ferry/Northside', 'Brookhaven', 'Buckhead Forest', 'Buckhead Village', 'East Chastain Park', 'Garden Hills', 'Lenox', 'Lindbergh/Morosgo', 'North Buckhead', 'Peachtree Heights East', 'Peachtree Heights West', 'Peachtree Hills', 'Peachtree Park', 'Pine Hills', 'Ridgedale Park', 'Berkeley Park', 'Blandtown', 'Bolton', 'Hills Park', 'Riverside', 'Underwood Hills', 'Almond Park', 'Atlanta Industrial Park', 'Bolton Hills', 'Brookview Heights', 'Carey Park', 'Carver Hills', 'Chattahoochee', 'English Park', 'Lincoln Homes', 'Monroe Heights', 'Rockdale', 'Scotts Crossing', 'Adamsville', 'Baker Hills', 'Bakers Ferry', 'Bankhead Courts', 'Bankhead/Bolton', 'Boulder Park', 'Carroll Heights', 'Fairburn Heights', 'Fairburn Road/Wisteria Lane', 'Fairburn Mays', 'Mays', 'Oakcliff', 'Old Gordon', 'Ridgecrest Forest', 'Wildwood', 'Wilson Mi

178

In [31]:
atlanta_df= pd.DataFrame(atlanta_breaks, columns = ['Neighborhoods'])
atlanta_df.head()

NameError: name 'atlanta_breaks' is not defined

In [93]:
atlanta_df.to_csv('new_atlanta.csv')

## Scraping Dallas, TX

In [8]:
dallas_tables = dallas_soup.select("body")[0]
headers = list()
dallas_headers = dallas_tables.find_all("h3", class_ = "boundary_group_header")
for header in dallas_headers:
    header = str(header.get_text()).strip()
    headers.append(header)
print(headers)
len(headers)

['Plano & North Dallas', 'Dallas', 'East & South Dallas', 'Fort Worth & West', 'Wylie', 'Watagua', 'Garland', 'Arlington', 'Fort Worth', 'Other neighborhoods close to Dallas']


10

In [9]:
dallas_rows = dallas_tables.find_all('ul', class_='unstyled clearfix')
dallas_hoods_list= list()
dallas_areas_list= list()
count= -1

for ul in dallas_rows:
    count = count +1
    a = ul.find_all('a')
    for anchor in a:
        anch= str(anchor.get_text()).strip()
        dallas_areas_list.append(headers[count])
        dallas_hoods_list.append(anch)
print(len(dallas_areas_list))
len(dallas_hoods_list)        

219


219

In [12]:
dallas_df= pd.DataFrame({'Area': dallas_areas_list , 'Neighborhood': dallas_hoods_list})
dallas_df = dallas_df.set_index('Area')
dallas_df

Unnamed: 0_level_0,Neighborhood
Area,Unnamed: 1_level_1
Plano & North Dallas,Addison
Plano & North Dallas,Argyle
Plano & North Dallas,Bartonville
Plano & North Dallas,Celeste
Plano & North Dallas,Coppell
Plano & North Dallas,Corinth
Plano & North Dallas,Double Oak
Plano & North Dallas,Farmers Branch
Plano & North Dallas,Grapevine
Plano & North Dallas,Greenville


In [13]:
dallas_df.to_csv('dallas_data.csv')

## Scraping Kansas City, MO

In [14]:
KC_tables = KC_soup.select("body")[0]
headers = list()
KC_headers = KC_tables.find_all("h3")
for hthree in KC_headers:
    KC_areas = hthree.find_all('span', class_= 'mw-headline')
    for area in KC_areas:
        header = str(area.get_text()).strip()
        headers.append(header)
print(headers)
len(headers)

['CBD-Downtown', 'Greater Downtown', 'East Side', 'Midtown-Westport', 'Northeast', 'Northland', 'Plaza area', 'South Kansas City']


8

In [15]:
KC_rows = KC_tables.find('div', class_='mw-parser-output').find_all('ul')
KC_hoods_list= list()
KC_areas_list= list()
new_ul_list = list()
count=0
for ul in KC_rows:
    count = count +1
    if count>2 and count < 11:
        KC_li= ul.find_all('li')
        for li in KC_li:
            li_list = str(li.get_text()).strip()
            KC_areas_list.append(headers[count-3])
            KC_hoods_list.append(li_list)
print(KC_hoods_list)
print(len(KC_areas_list))
print(len(KC_hoods_list))

['CBD-Downtown', '18th and Vine', 'Beacon Hill-McFeders', 'Columbus Park', 'Crossroads', 'Hospital Hill', 'Library District', 'Longfellow/Dutch Hill', 'Quality Hill', 'River Market', 'Union Hill', 'West Side', 'Ashland Ridge', 'Blue Hills', 'Blue Valley', 'Boulevard Village', 'Brown Estates', 'Country Valley-Hawthorne Square', 'Cunningham Ridge', 'Dunbar', 'East 23rd Street P.A.C.', 'Eastwood Hills', 'Glen Lake', 'Glen Oaks', 'Ingleside', 'Ivanhoe', 'Key Coalition', 'Knoches Park', 'Leeds', 'Mount Cleveland', 'Oak Park', 'Palestine', 'Parkview', 'Riss Lake', 'Santa Fe', 'Sheraton Estates', 'Stayton Meadows', 'Sterling Acres', 'Sterling Gardens', 'Vineyard', 'Vineyard Estates', 'Washington-Wheatley', 'Wendell Phillips', 'Center City', 'Coleman Highlands', 'Hanover Place', 'Hyde Park', 'Manheim Park', 'Mount Hope', 'Old Hyde Park Historic District, Inc.', 'Old Westport', 'Plaza Westport', 'Roanoke', 'Southmoreland', 'Squier Park', 'Valentine', 'Volker', 'Westport', 'Forgotten Homes', 'In

In [16]:
KC_df = pd.DataFrame({'Area': KC_areas_list, 'Neighborhood': KC_hoods_list})
print(len(KC_df['Area']))
KC_df = KC_df.set_index('Area')
KC_df.head()

232


Unnamed: 0_level_0,Neighborhood
Area,Unnamed: 1_level_1
CBD-Downtown,CBD-Downtown
Greater Downtown,18th and Vine
Greater Downtown,Beacon Hill-McFeders
Greater Downtown,Columbus Park
Greater Downtown,Crossroads


In [17]:
KC_df.to_csv('kansas_city_data.csv')