# Capstone Project - The Battle of the Neighborhoods (Week 1)
### Applied Data Science Capstone by IBM/Coursera

<h1 align=center><font size = 4>Segmenting and Clustering Neighborhoods in Bangalore City For New Restaurants</font></h1>

 <h2 align=center><font size = 4>Introduction</font></h2>
This Coursera Capstone Project is for IBM Data Science Final Assignment. In this project we have taken a scenario where a person wants to open a new restaurant in the nearby space of an IT office in Bangalore. As opening the new restaurant in these areas is costly affair, it becomes important to analyze the places and existing restaurants around the offices to get a clear idea of the competition there. With the help of Machine Learning, we can not only determine the probable areas but the cuisines can also be analyzed. This project involves gathering location information of the IT offices spaces and using Foursquare APIs to get the location based details. Using the nearby venues we can divide the places in clusters to determine the best options. 

### Importing the Python libraries

In [4]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


 <h2 align=center><font size = 4>Data</font></h2>

In this assignment first we have to clean the data obtained from Web page for the Bangalore city and using web scraping method. Then convert addresses into their equivalent latitude and longitude values. Use Foursquare API to explore neighborhoods in Bangalore City. Get the most common venue categories in each neighborhood, and then use this feature to group the neighborhoods into clusters. Use the *k*-means clustering algorithm to complete this task. Finally, use the Folium library to visualize the neighborhoods in Bangalore City and their emerging clusters.This project involves gathering location information of the IT offices spaces and using Foursquare APIs to get the location based details. Using the nearby venues we can divide the places in clusters to determine the best options. 

Based on definition of our problem, factors that will influence our decision are:
* number of existing restaurants in the neighborhood (any type of restaurant)
* number of offices present in a particular neighborhood 
* Preferable nearest places (Most Valued)

Following data sources will be needed to extract/generate the required information:
* The data is pulled from the site - here, which contains 300 IT company name along with their address. The data also contain PIN code for all the offices. 
* Foursquare API are used to understand the data graphically and get the details of the nearby places

## 1. Scrap data from Wikipedia page into a DataFrame

Get the data set from the URL : https://www.naukri2000.com/careers/it_bangalore.php and then using BeautifulSoup package

In [5]:
dataset = requests.get('https://www.naukri2000.com/careers/it_bangalore.php').text

In [6]:
from bs4 import BeautifulSoup # library to parse HTML and XML documents

In [7]:
# parse data from the html into a beautifulsoup object
soup = BeautifulSoup(dataset, 'html.parser')

### Creating dataframe with the Company_Name and Company_Address

In [8]:
# create three lists to store table data
company_name = []
company_address = []
for row in soup.find_all('table'):
    cells = row.find_all('tr')
for companies in cells:
    company_name.append(companies.contents[2].text)
    company_address.append(companies.contents[4].text)
bangalore_df = pd.DataFrame({"Company_Name": company_name,
                           "Company_Address": company_address})


In [9]:
bangalore_df.head(10)

Unnamed: 0,Company_Name,Company_Address
0,Name,Address
1,24/7 Customer Pvt Ltd,"Survey No 2/1, 2/2, 2/3, & 5/1, Challaghatta V..."
2,247 Learning Solutions Pvt Ltd,"No 20, Annaswamy Mudaliar Road, Ulsoor Lake, \..."
3,Accenture Services Pvt Ltd,71 Cunningham Road\rBangalore - 560 052
4,Accord Software & Systems Pvt Ltd,"# 37, K.R. Colony, Domlur Layout, \r\nBangalor..."
5,Acme Insurance Services Pvt Ltd,"3rd Floor, Monarch Chambers, \r122, Infantry R..."
6,Adaptec (India) Pvt Ltd,"No:5 , First Floor, \r\nSalarpuria Infinity\r\..."
7,Adea International Pvt Ltd,"No.319/1, Bommanahalli\r\nHosur Main Road\r\nB..."
8,Aditi Technologies Pvt Ltd,224/16 Ramana Maharishi Rd\r\nBangalore 560 080
9,Affiliated Computer Services of India (P) Ltd ...,"Level 2, Creator Block International Tech Park..."


### Extracting the PIN code from address 

In [10]:
# View the data of the dataframe
import re
company_code = []
for addr in bangalore_df['Company_Address']:
   # numbers = re.findall('\d+',addr) 
    nums = addr.split('Bangalore')[-1]
    numbers = re.findall('\d+',nums) 
    #print(addr)
    company_code.append("".join(numbers))

### Appending the PIN Code in the Original dataframe

In [11]:
bangalore_df['Comapany_Code'] = company_code

### Cleaning Data to remove header

In [12]:
bangalore_df = bangalore_df.loc[1:]

In [13]:
bangalore_df.head()

Unnamed: 0,Company_Name,Company_Address,Comapany_Code
1,24/7 Customer Pvt Ltd,"Survey No 2/1, 2/2, 2/3, & 5/1, Challaghatta V...",
2,247 Learning Solutions Pvt Ltd,"No 20, Annaswamy Mudaliar Road, Ulsoor Lake, \...",560042.0
3,Accenture Services Pvt Ltd,71 Cunningham Road\rBangalore - 560 052,560052.0
4,Accord Software & Systems Pvt Ltd,"# 37, K.R. Colony, Domlur Layout, \r\nBangalor...",560071.0
5,Acme Insurance Services Pvt Ltd,"3rd Floor, Monarch Chambers, \r122, Infantry R...",560001.0


### Correcting the PIN CODE in case of any erroneous code 

In [14]:
bangalore_df['Comapany_Code'] = bangalore_df['Comapany_Code'].str[-6:]

### Getting the corresponding Latitude and Longitude details of all the office address using the 'geopy' package 

In [15]:
from geopy.geocoders import Nominatim
lat = []
long = []
for address in bangalore_df['Comapany_Code']:
    if address:
        geolocator = Nominatim()
        location = geolocator.geocode(address)
        if location:
            #print(location)
            #print((location.latitude, location.longitude))
            lat.append(location.latitude)
            long.append(location.longitude)
        else:
            lat.append("")
            long.append("")
    else:
        lat.append("")
        long.append("")
        

  


### Appending the Geographical Coordinates to the Original dataframe

In [16]:
bangalore_df['latitude'] = lat
bangalore_df['longitude'] = long

In [17]:
bangalore_df.loc[3:]

Unnamed: 0,Company_Name,Company_Address,Comapany_Code,latitude,longitude
3,Accenture Services Pvt Ltd,71 Cunningham Road\rBangalore - 560 052,560052.0,12.9902,77.596
4,Accord Software & Systems Pvt Ltd,"# 37, K.R. Colony, Domlur Layout, \r\nBangalor...",560071.0,12.9576,77.6404
5,Acme Insurance Services Pvt Ltd,"3rd Floor, Monarch Chambers, \r122, Infantry R...",560001.0,-33.0381,137.576
6,Adaptec (India) Pvt Ltd,"No:5 , First Floor, \r\nSalarpuria Infinity\r\...",560029.0,12.9262,77.5974
7,Adea International Pvt Ltd,"No.319/1, Bommanahalli\r\nHosur Main Road\r\nB...",560068.0,12.9003,77.6198
8,Aditi Technologies Pvt Ltd,224/16 Ramana Maharishi Rd\r\nBangalore 560 080,560080.0,13.0001,77.5833
9,Affiliated Computer Services of India (P) Ltd ...,"Level 2, Creator Block International Tech Park...",560066.0,12.9536,77.7158
10,Ajax.com Pvt Ltd,"#1, 3rd Floor Maruthi Complex, \r\nAbove Food ...",560032.0,13.0253,77.5984
11,Akamai Technologies India Pvt Ltd,"Salarpuria Ascent\r\n#77, Jyothi Nivas College...",560095.0,12.9375,77.6179
12,Altair Engineering India Pvt Ltd,"Mercury 2B Block, 5th Floor,\r\nPrestige Tech ...",560078.0,12.9005,77.5704


As we have got all the required data including the geographicl coordinates of the locations , we can now use the **Foursquare API** to obtain the nearby venue details. First we have to create the URL with all the required details and then use that URL to take the coordinates for each neighborhood and get the venue name and create a dataframe to keep the output

In [18]:
address = 'Bangalore'

geolocator = Nominatim(user_agent="my-application")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Bangalore are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Bangalore are 12.9791198, 77.5912997.


### Let's create map of Bangalore with all the location details using the folium package

In [19]:
# create map of Bangalore using latitude and longitude values
map_bangalore = folium.Map(location=[latitude, longitude], zoom_start=10)

 # add markers to map
for lat, lng, name1, name2 in zip(bangalore_df['latitude'], bangalore_df['longitude'], bangalore_df['Company_Name'], bangalore_df['Company_Name']):
    if lat and long:
        label = '{}, {}'.format(name1, name2)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='blue',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(map_bangalore)  
    
map_bangalore

Create the URL using the CLIENT_ID and CLIENT_SECRET using Foursquare Developers Account

In [26]:
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '20191225'  # Foursquare API version
#'20180605'
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 
CLIENT_SECRET:


In [27]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude, 
    longitude, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=&client_secret=&v=20191225&ll=12.9791198,77.5912997&radius=500&limit=100'

Using the **Foursquare API** to get the **Venue list** and then create a dataframe using the list for each and every location

In [22]:
venues_list=[]
for name, lat, lng in zip(bangalore_df['Company_Name'],bangalore_df['latitude'],bangalore_df['longitude']):
    print(name)
    if lat and long:      
    # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
                CLIENT_ID, 
                CLIENT_SECRET, 
                VERSION, 
                lat, 
                lng, 
                radius, 
                LIMIT)
                
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        venues_list.append([(
                name, 
                lat, 
                lng, 
                v['venue']['name'], 
                v['venue']['location']['lat'], 
                v['venue']['location']['lng'],  
                v['venue']['categories'][0]['name']) for v in results])

24/7 Customer Pvt Ltd 
247 Learning Solutions Pvt Ltd 
Accenture Services Pvt Ltd 
Accord Software & Systems Pvt Ltd 
Acme Insurance Services Pvt Ltd 
Adaptec (India) Pvt Ltd 
Adea International Pvt Ltd 
Aditi Technologies Pvt Ltd 
Affiliated Computer Services of India (P) Ltd (ACS) 
Ajax.com Pvt Ltd 
Akamai Technologies India Pvt Ltd 
Altair Engineering India Pvt Ltd 
Antares Systems Ltd 
ANZ Information Technology Pvt. Ltd. 
Arowana Consulting Ltd 
Artech Infosystems Pvt Ltd 
Ascendas Property Management Services (India) Pvt Ltd 
ASM Technologies Ltd 
Aspect Technology Center (India) Pvt Ltd 
Avaya GlobalConnect Ltd 
Axes Technologies (I) Pvt Ltd 
Aztecsoft Ltd 
BAeHAL Software Limited 
Bahwan CyberTek Pvt Ltd 
Bangalore Softsell Ltd 
Bells Softech Limited 
BEML Technology Division (A div of Bharat Earth Movers Ltd) 
Bharti Telesoft Ltd. 
Blue Chip Computer Consultants Pvt Ltd 
Blue Star Infotech Ltd 
Borland India Pvt Ltd 
Business Process Outsourcing (India) Pvt Ltd 
C1 India Pvt L

Thirdware Solution Ltd. 
Tholons Knowledge Management Pvt Ltd 
Thomson Corporation (International) Pvt Ltd 
ThoughtWorks Technologies India Pvt Ltd 
TIBCO Software India Pvt Ltd 
Timken Engineering and Research India Pvt Ltd 
TPI Advisory Services India Pvt Ltd 
TQM International Pvt Ltd 
Transfleet Global Services Pvt Ltd (TESCO India) 
Trianz Consulting Pvt Ltd 
Tricon Infotech Pvt. Ltd 
Trigent Software Ltd. 
Trilogy E-business Software India Ltd 
Trimentus Technologies Pvt Ltd 
TriVium iCOPE Technologies Pvt Ltd 
TRRS Imaging Ltd 
TRRS Imaging Ltd 
TTK Healthcare Services Pvt Ltd 
U&I Scotty Computers Ltd 
UBICS Technologies Pvt Ltd 
UL India Pvt Ltd 
Unisys Global Services - India (STP Division of Unisys India Pvt Ltd) 
Universal Legal 
USi Internetworking Services Pvt Ltd 
UTL Technologies Ltd. 
Utopia India Pvt Ltd 
Valtech India Technology Solutions Pvt Ltd 
Vee Technologies Pvt Ltd 
Vinciti Networks Pvt Ltd 
Vinpack India pvt. Ltd 
Viteos Capital Market Services Ltd 
vMoksha T

In [23]:
bangalore_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
bangalore_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']

#### Let's view the data if we got the details with venue details for all the Neighborhoods

In [24]:
bangalore_venues.head(20)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,247 Learning Solutions Pvt Ltd,12.980207,77.614153,Westside,12.98257,77.610713,Shopping Mall
1,247 Learning Solutions Pvt Ltd,12.980207,77.614153,Mysore Saree Udyog,12.981433,77.610214,Women's Store
2,247 Learning Solutions Pvt Ltd,12.980207,77.614153,Chaipatty ulsoor,12.976061,77.615338,Tea Room
3,247 Learning Solutions Pvt Ltd,12.980207,77.614153,Vashi's House of Jeans,12.981449,77.610308,Clothing Store
4,247 Learning Solutions Pvt Ltd,12.980207,77.614153,Bobby's Punjabi Dhaba,12.983826,77.613955,Indian Restaurant
5,247 Learning Solutions Pvt Ltd,12.980207,77.614153,Cafe Coffee Day Ulsoor Lake,12.979498,77.618151,Café
6,247 Learning Solutions Pvt Ltd,12.980207,77.614153,Sreeraj Lassi Bar,12.982748,77.610739,Juice Bar
7,247 Learning Solutions Pvt Ltd,12.980207,77.614153,Anand Sweets,12.981667,77.60979,Candy Store
8,247 Learning Solutions Pvt Ltd,12.980207,77.614153,Baskin-Robbins,12.975867,77.614559,Ice Cream Shop
9,247 Learning Solutions Pvt Ltd,12.980207,77.614153,Sri Krishna Diamonds and Jewellery,12.981445,77.610212,Jewelry Store


In [25]:
print(bangalore_venues.shape)

(6922, 7)
