## Table of contents
* [Introduction](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## INTRODUCTION <a name="introduction"></a>

The Tech Industry is growing at an exponential rate and new jobs are springing up every day. With this rate of increase, it is very likely for Data Scientists to get jobs across different cities.

Apart from the remunerations and other benefits pertaining to a job offer, employees may still be interested in the social life of a new location compared to their current location. For example, a Data Scientist working in Toronto and gets new offers in Dallas and New York could be interested in how multicultural are these cities compared to Toronto. 

In this project, we apply machine learning models to analyze the similarities and dissimilarities between the 2020 top tech cities to get IT jobs and present a recommender system for choosing the best location to work based on the current social life of where a Data Scientist lives. 

This problem could also be extended to start-ups looking to relocate or open new branches and also for employees looking for a good vacation resort.

## DATA COLLECTION <a name="data"></a>

* First we would apply **BeautifulSoup** for web scraping to get Top tech cities for IT jobs in 2020 from: https://dailyhive.com/toronto/toronto-tech-talent-top-ranking-north-america-report.
* Use **Google Maps API geocoding** to obtain the longitude and latitude for each of the cities

* Obtain data on the venues around each city using **Foursquare API** 

In [5]:
!pip install folium

Collecting folium
  Downloading folium-0.11.0-py2.py3-none-any.whl (93 kB)
[K     |████████████████████████████████| 93 kB 2.6 MB/s  eta 0:00:01
Collecting branca>=0.3.0
  Downloading branca-0.4.1-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.1 folium-0.11.0


In [7]:
!pip install geocoder

Collecting geocoder
  Downloading geocoder-1.38.1-py2.py3-none-any.whl (98 kB)
[K     |████████████████████████████████| 98 kB 5.5 MB/s eta 0:00:011
Collecting ratelim
  Downloading ratelim-0.1.6-py2.py3-none-any.whl (4.0 kB)
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6


In [8]:
#Import modules
from bs4 import BeautifulSoup
import pandas as pd
import folium
import requests
import geocoder
import numpy as np

In [9]:
#run a get request to get details from the page
URL = "https://dailyhive.com/toronto/toronto-tech-talent-top-ranking-north-america-report"
page = requests.get(URL)

In [10]:
#prettify the data in html format
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id="article-115047")
print(results.prettify())

<div class="wp-content" id="article-115047">
 <p>
  It’s no secret that Ontario’s capital city is a leader in technology, but once again the city has been recognized as one of the top hubs for tech on the continent, beating out the likes of New York, Boston, Vancouver, and more.
 </p>
 <p>
  CBRE has released its Tech Talent report for 2020 and revealed that Toronto is now North America’s fourth top market for tech talent — falling
  <a href="https://dailyhive.com/toronto/toronto-tech-talent-cbre-ranking-2019" target="_blank">
   one place compared to last year
  </a>
  . According to the report, the city was a North American leader in tech employment growth “leading up to the COVID-19 pandemic, helping to fortify the city’s status as one of the world’s leading tech centre.”
 </p>
 <p style="font-weight: 400;">
  Toronto was also able to add 66,900 tech jobs over the past five years, the second-most of any North American city in the past five years, beat out only by the San Francisco’s

In [11]:
#get the text for all lists containing the list of top cities
tab_data = [cell.text for cell in results.find_all(["li"])]
df = pd.DataFrame(tab_data,dtype='string')
df.dtypes

0    string
dtype: object

In [12]:
df = df.loc[4:]
df.reset_index(inplace=True, drop=True)
df

Unnamed: 0,0
0,"San Francisco Bay Area, CA"
1,"Washington, D.C."
2,"Seattle, WA"
3,"Toronto, ON"
4,"New York, NY"
5,"Austin, TX"
6,"Denver, CO"
7,"Boston, MA"
8,"Atlanta, GA"
9,"Raleigh-Durham, NC"


In [13]:
#Split city and states
df.columns =['City']
new=df['City'].str.split(", ",expand=True)
df['City']=new[0]
df['State'] = new[1]
df.head()

Unnamed: 0,City,State
0,San Francisco Bay Area,CA
1,Washington,D.C.
2,Seattle,WA
3,Toronto,ON
4,New York,NY


In [14]:
# The code was removed by Watson Studio for sharing.

In [15]:
# Use the geocoder package to obtain latitude and longitude data for each city and add this to the dataframe, df
Latitude=[]
Longitude=[]
for index, row in df.iterrows():
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        city = row['City']
        state = row['State']
        g = geocoder.google(city +','+ state,key=APIKey)
        lat_lng_coords = g.latlng
    Latitude.append(lat_lng_coords[0])
    Longitude.append(lat_lng_coords[1])
    

df['Latitude']=Latitude
df['Longitude'] =Longitude

df.head()

Unnamed: 0,City,State,Latitude,Longitude
0,San Francisco Bay Area,CA,37.827178,-122.291308
1,Washington,D.C.,38.907192,-77.036871
2,Seattle,WA,47.606209,-122.332071
3,Toronto,ON,43.653226,-79.383184
4,New York,NY,40.712775,-74.005973


In [17]:
#Get the longitude and latitude for North America and apply this in plotting a folium map
t = geocoder.google('North America', key =APIKey)
lat_lng_coords = t.latlng
na_lat =lat_lng_coords[0]
na_long = lat_lng_coords[1]
print('The geograpical coordinate of North America are {}, {}.'.format(na_lat,na_long ))

The geograpical coordinate of North America are 54.5259614, -105.2551187.


In [18]:
#Map of North America showing the location of the top cities
map_na = folium.Map(location=[na_lat, na_long], zoom_start=2)

# add markers to map
for lat, lng, city, state in zip(df['Latitude'], df['Longitude'], df['City'], df['State']):
    label = '{}, {}'.format(city, state)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_na)  
    
map_na

## Get the data for venues across the cities using **Foursquare API**

In [19]:
# The code was removed by Watson Studio for sharing.

Your credentails:
CLIENT_ID: IN5XIVC04GQYE450TM3CZ5KBRS1ZXLMPTA20AGJLEY2VDETX
CLIENT_SECRET:FGLFDVHZYHHFTYBHHIHH3ISJZFUCKHD0SU0WW2FKUIII53ZR


In [20]:
#first we define a function to get all the venues around a radius for each city
def getNearbyVenues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City', 
                  'City Latitude', 
                  'City Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [21]:
na_venues = getNearbyVenues(names=df['City'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )

San Francisco Bay Area
Washington
Seattle
Toronto
New York
Austin
Denver
Boston
Atlanta
Raleigh-Durham
Baltimore
Vancouver
Dallas/Ft. Worth
Ottawa
Salt Lake City
Montreal
Minneapolis
Phoenix
San Diego
Portland
Orange Country
Philadelphia
Chicago
Columbus
Newark
Los Angeles
Madison
Charlotte
Tampa
Pittsburgh


In [22]:
na_venues.shape

(2649, 7)

In [23]:
#2649 venues were returned across all cities, let us see how many venues for each city
na_venues.groupby('City').count()

Unnamed: 0_level_0,City Latitude,City Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Atlanta,82,82,82,82,82,82
Austin,100,100,100,100,100,100
Baltimore,100,100,100,100,100,100
Boston,100,100,100,100,100,100
Charlotte,100,100,100,100,100,100
Chicago,100,100,100,100,100,100
Columbus,100,100,100,100,100,100
Dallas/Ft. Worth,1,1,1,1,1,1
Denver,100,100,100,100,100,100
Los Angeles,100,100,100,100,100,100


In [24]:
print('There are {} uniques categories.'.format(len(na_venues['Venue Category'].unique())))

There are 313 uniques categories.


### There are 313 unique venue categories. This data would be explored further to develop clusters across cities and a recommender system for relocation