# Web Scrapping Script
------------------------------------

## Datasets

Uber API - https://developer.uber.com/dashboard/

Uber Rides Python SDK (beta) - https://github.com/uber/rides-python-sdk 

Lyft API - https://developer.lyft.com/v1/reference 

Weather API - https://openweathermap.org/api

Yelp API - https://www.yelp.com/developers/documentation/v3/business_search 

As indicated in the Datasets section of this , we queried the Uber API. In order to do that, we must first register an application on Uber’s developer dashboard and install the uber-rides SDK, which is also available on Github.

After successfully creating an Uber session with a server token, we found out that the Uber Ride API accepts a pair of coordinates (latitudes and longitudes) as parameters to return data on its various types of services and their price estimates. 

In [1]:
from __future__ import print_function     #We can use print as a function.
import json
import pprint
import requests
import urllib
import os
from datetime import timedelta
import datetime                           #Allows us to use data time functions and set the date time as an index
import pandas as pd                       #to load the data file as a Pandas data frame and analyze the data
import numpy as np                        #It provides some advance math functionalities to python
import sqlite3                            #Allows us to connect to a database
from pandas.io import sql
from pandas.io.json import json_normalize

Getting the proper imports to run our script.

## Yelp API

In this study, we focus on the Uber ride in Boston and Cambridge areas. 

We have restricted our study to these two small geographic regions due to practical issues such as analysis on a smaller region will help us to predict better with the help of other factors such as weather or popularity of places. This would provide us to provide efficient analysis with available resources.


We are using the Yelp Api to get the location cordinates, which will be feeded to to both the Uber and Lyft Api for price prediction.

In [2]:
#Boston zipcodes: http://www.city-data.com/zipmaps/Boston-Massachusetts.html

#Cambridge zipcodes: http://www2.cambridgema.gov/CityOfCambridge_Content/documents/ZipCodeMap.pdf

zipcodes = ["02108", "02109", "02110", "02111", "02113", "02114", "02115", "02116", "02118", "02119", "02120", "02121", "02122", 
            "02124", "02125", "02126", "02127", "02128", "02129", "02130", "02131", "02132", "02134", "02135", "02136", "02151", 
            "02152", "02163", "02199", "02203", "02210", "02215", "02467", "02138", "02139", "02140", "02141", "02142"]
# These are the zipcodes covering all the neighbourhoods in Boston

Instead of inputting random coordinates in Boston and Cambridge, which are the areas we are parameterizing, we opted to query Yelp’s API as well because it returns actual businesses’ location and thankfully the returned data contains longitudes and latitudes.

In [3]:
# data = {'grant_type': 'client_credentials',
#         'client_id': app_id,
#         'client_secret': app_secret}
# token = requests.post('https://api.yelp.com/oauth2/token', data=data)
# access_token = token.json()['access_token']
api_key = 'nIBG6z_BWFoH2CUvBagSh-7LNxr0UIXp0TIgnKrxDvCkBiCYu2InnKrqFvm-_KYm2a8_EBoFGiAqv5OMsdg6h38Vn-BiQNLx6hhiTgKD1F7gBKNn5SQRAJgh3EGcWnYx'
url = 'https://api.yelp.com/v3/businesses/search'
headers = {'Authorization': 'bearer %s' % api_key}

# Yelp v3 API: https://nz.yelp.com/developers/documentation/v3
# https://www.yelp.com/developers/documentation/v3/business_search

for z in zipcodes:   
    params = {'location': '%s' % z, # for loop of zipcodes in Boston and Cambridge
              'categories': 'active', # active stands for all businesses https://www.yelp.com/developers/documentation/v3/all_category_list
              'limit': 50} # maximum 50
    
response = requests.get(url=url, params=params, headers=headers) #request.get allows us to make clean http requests to the api

results = response.json()['businesses']  # getting the json file for the above paramenters
for business in results:
     print(business['name'], business['location'])

print ('\nTotal Businesses retrieved:', len(results))


conn = sqlite3.connect('yelp.db')     #conn allows us to connect to the yelp database to query location information
cur = conn.cursor()


# normalizing json to pandas dataframe
df = json_normalize(results)

df = df.drop(['categories', 'location.display_address', 'transactions'], axis=1)

#Renaming the columns
df.columns = df.columns.str.replace(r'[.]', '_')

# converting to sqlite
df.to_sql("yelp_businesses", conn, if_exists="replace")

#Getting the data into our dataframe
pd.read_sql_query("select * from yelp_businesses;", conn)

Cambridge Center Roof Garden {'address1': '4 Cambridge Ctr', 'address2': '', 'address3': '', 'city': 'Cambridge', 'zip_code': '02142', 'country': 'US', 'state': 'MA', 'display_address': ['4 Cambridge Ctr', 'Cambridge, MA 02142']}
Boston Common {'address1': 'Beacon St', 'address2': None, 'address3': '', 'city': 'Boston', 'zip_code': '02139', 'country': 'US', 'state': 'MA', 'display_address': ['Beacon St', 'Boston, MA 02139']}
Trapology Boston {'address1': '177 Tremont St', 'address2': 'Fl 2', 'address3': '', 'city': 'Boston', 'zip_code': '02111', 'country': 'US', 'state': 'MA', 'display_address': ['177 Tremont St', 'Fl 2', 'Boston, MA 02111']}
Community Ice Skating at Kendall Square {'address1': '300 Athenaeum St', 'address2': None, 'address3': '', 'city': 'Cambridge', 'zip_code': '02142', 'country': 'US', 'state': 'MA', 'display_address': ['300 Athenaeum St', 'Cambridge, MA 02142']}
Btone Fitness - Back Bay {'address1': '30 Newbury St', 'address2': 'Fl 4', 'address3': '', 'city': 'Bost

Unnamed: 0,index,alias,coordinates_latitude,coordinates_longitude,display_phone,distance,id,image_url,is_closed,location_address1,...,location_city,location_country,location_state,location_zip_code,name,phone,price,rating,review_count,url
0,0,cambridge-center-roof-garden-cambridge,42.363166,-71.086729,,233.226556,x4vRczmgp446CdsDiJCUag,https://s3-media2.fl.yelpcdn.com/bphoto/4znlgU...,0,4 Cambridge Ctr,...,Cambridge,US,MA,2142,Cambridge Center Roof Garden,,,5.0,56,https://www.yelp.com/biz/cambridge-center-roof...
1,1,boston-common-boston,42.355373,-71.06575,(304) 563-4951,1801.659734,QFq8CccuRrEq9MsTUCnD9Q,https://s3-media1.fl.yelpcdn.com/bphoto/Jhx9n_...,0,Beacon St,...,Boston,US,MA,2139,Boston Common,13045634951.0,,4.5,397,https://www.yelp.com/biz/boston-common-boston?...
2,2,trapology-boston-boston,42.353014,-71.064216,(857) 285-2085,2055.066728,FaJGhYEOaYso9uKSYMAgbg,https://s3-media4.fl.yelpcdn.com/bphoto/s7_-qt...,0,177 Tremont St,...,Boston,US,MA,2111,Trapology Boston,18572852085.0,,5.0,135,https://www.yelp.com/biz/trapology-boston-bost...
3,3,community-ice-skating-at-kendall-square-cambridge,42.36458,-71.0815,(617) 492-0941,233.540483,wI6WwxkMwt4jFJrx16jNSA,https://s3-media2.fl.yelpcdn.com/bphoto/kd84-n...,0,300 Athenaeum St,...,Cambridge,US,MA,2142,Community Ice Skating at Kendall Square,16174920941.0,,4.5,35,https://www.yelp.com/biz/community-ice-skating...
4,4,btone-fitness-back-bay-boston,42.35204,-71.07259,(617) 578-8663,1647.81565,cZOzUfwPz8ANog-8nWzhRw,https://s3-media4.fl.yelpcdn.com/bphoto/I3gcI1...,0,30 Newbury St,...,Boston,US,MA,2116,Btone Fitness - Back Bay,16175788663.0,,4.5,176,https://www.yelp.com/biz/btone-fitness-back-ba...
5,5,charles-river-bike-paths-boston,42.360307,-71.072563,,1046.41381,WGqocG3zwHIsNb_dFNTF8A,https://s3-media2.fl.yelpcdn.com/bphoto/HrngKN...,0,Storrow And Memorial Dr,...,Boston,US,MA,2228,Charles River Bike Paths,,,4.5,77,https://www.yelp.com/biz/charles-river-bike-pa...
6,6,turnstyle-kendall-square-cambridge-2,42.36657,-71.09019,(617) 531-8922,601.138535,fF1MNu2TTDTNsLqx4u_80w,https://s3-media3.fl.yelpcdn.com/bphoto/J1k6jd...,0,One Kendall Square,...,Cambridge,US,MA,2139,Turnstyle - Kendall Square,16175318922.0,,4.0,69,https://www.yelp.com/biz/turnstyle-kendall-squ...
7,7,bodyscapes-fitness-cambridge,42.36368,-71.08341,(617) 252-0020,81.803994,ztZgacrzSQam20TVYZbKKQ,https://s3-media2.fl.yelpcdn.com/bphoto/Xt74Ot...,0,356 Third St,...,Cambridge,US,MA,2142,BodyScapes Fitness,16172520020.0,,4.5,26,https://www.yelp.com/biz/bodyscapes-fitness-ca...
8,8,the-charles-river-boston,42.361275,-71.074905,,827.063258,06jyL7iohG-Kvu7iiJblGg,https://s3-media4.fl.yelpcdn.com/bphoto/9-DD7M...,0,Storrow Dr,...,Boston,US,MA,2114,The Charles River,,,4.5,42,https://www.yelp.com/biz/the-charles-river-bos...
9,9,cambridge-athletic-club-cambridge,42.36422,-71.07854,(617) 491-8989,462.581625,37jNprFs5eS7sHwqllcSrA,https://s3-media3.fl.yelpcdn.com/bphoto/r7Ot1O...,0,215 First St,...,Cambridge,US,MA,2142,Cambridge Athletic Club,16174918989.0,,4.0,46,https://www.yelp.com/biz/cambridge-athletic-cl...


We are using api to scrap the data within the parameters i.e. zipcodes, and then we convert the json file to our dataframe.

## Uber API

Using the proper import methods to call the api.

In [4]:
# pip install uber-rides
from uber_rides.session import Session as uber_Session
from uber_rides.client import UberRidesClient
# conda install -c conda-forge geopy
from geopy.distance import vincenty
import csv

session = uber_Session(server_token='Uvu3eEPnLtPKCbTU7KrCko5jo1ua4CVgYAqd0JfO')
client = UberRidesClient(session)

Getting the source and destination coordinates for the API using our yelp database.

In [5]:
#Querying the database to get the required informarion from the database for the source
df1 = pd.read_sql_query("SELECT name, coordinates_latitude, coordinates_longitude, location_address1, location_address2, location_address3, location_city, location_state, location_zip_code, location_country FROM yelp_businesses ORDER BY RANDOM() LIMIT 1;", conn)
df1.head()

#Querying the database to get the required informarion from the database for the destination
df2 = pd.read_sql_query('SELECT name, coordinates_latitude, coordinates_longitude, location_address1, location_address2, location_address3, location_city, location_state, location_zip_code, location_country FROM yelp_businesses where name IN (SELECT name FROM yelp_businesses ORDER BY RANDOM() LIMIT 1)', conn)
df2.head()

#Since we only need the coordinates of the location we store it in an array, for ease of use
start_loc = (df1['coordinates_latitude'][0], df1['coordinates_longitude'][0])
start_loc

end_loc = (df2['coordinates_latitude'][0], df2['coordinates_longitude'][0])
end_loc

(42.3637, -71.08283)

Limiting the distance between our source and destinations to 1 mile, as no one would go for a ride within 1 mile.

In [6]:
distance = vincenty(start_loc, end_loc).miles
# We set the condition that the distance should be greater than 1 mile for entering the loop
if(distance > 1):           
    print(distance, '\n')
    # Calling the api and give the array as the source and destination to get the price estiamtes for a ride
    response = client.get_price_estimates(
    start_latitude= df1['coordinates_latitude'][0],
    start_longitude= df1['coordinates_longitude'][0],
    end_latitude=  df2['coordinates_latitude'][0],
    end_longitude= df2['coordinates_longitude'][0]
    )
    # The json file has price estimate for various type of rides offered by uber
    uber_rides = response.json.get("prices")
    print(uber_rides)
       

1.0299291303422835 

[{'localized_display_name': 'POOL', 'distance': 2.6, 'display_name': 'POOL', 'product_id': '997acbb5-e102-41e1-b155-9df7de0a73f2', 'high_estimate': 12.0, 'low_estimate': 8.0, 'duration': 540, 'estimate': '$8-11', 'currency_code': 'USD'}, {'localized_display_name': 'uberX', 'distance': 2.6, 'display_name': 'uberX', 'product_id': '55c66225-fbe7-4fd5-9072-eab1ece5e23e', 'high_estimate': 12.0, 'low_estimate': 9.0, 'duration': 540, 'estimate': '$9-12', 'currency_code': 'USD'}, {'localized_display_name': 'uberSUV', 'distance': 2.6, 'display_name': 'uberSUV', 'product_id': '6d318bcc-22a3-4af6-bddd-b409bfce1546', 'high_estimate': 36.0, 'low_estimate': 28.0, 'duration': 540, 'estimate': '$28-36', 'currency_code': 'USD'}, {'localized_display_name': 'uberXL', 'distance': 2.6, 'display_name': 'uberXL', 'product_id': '6f72dfc5-27f1-42e8-84db-ccc7a75f6969', 'high_estimate': 17.0, 'low_estimate': 13.0, 'duration': 540, 'estimate': '$13-17', 'currency_code': 'USD'}, {'localized_di

Setting time and cooridnates to our json file.

In [7]:
dt = datetime.datetime.now()    
# The following for loop will allow us to iterate through all the rows to add time, day, date, start location, end location and the coordinates
for rides in uber_rides:
    rides["time"] = dt.strftime('%H:%M:%S')
    rides['day'] = dt.strftime('%A')
    rides['date'] = dt.strftime('%B %d, %Y')
    rides["start_latitude"] = df1['coordinates_latitude'][0]
    rides["start_longitude"] = df1['coordinates_longitude'][0]
    rides["end_latitude"] = df2['coordinates_latitude'][0]
    rides["end_longitude"] = df2['coordinates_longitude'][0]
    rides['start_location'] = df1['name'][0]
    rides['end_location'] = df2['name'][0]

Converting our json file to the data frame and getting the output as a csv file. This will allow us to gather our data for further analysis.

In [8]:
df_uber = pd.DataFrame(uber_rides)

# to append when sending to server
with open('uber_test.csv', 'a') as f:
    df_uber.to_csv(f, sep=',', encoding='utf-8', index=False, header=False)

## Lyft API

Some imports that are needed to run the Lyft Api

In [9]:
from lyft_rides.auth import ClientCredentialGrant
from lyft_rides.session import Session as lyft_Session
from lyft_rides.auth import AuthorizationCodeGrant

Using the same source and destiantion coordinates to get price estimations using the Lyft api for getting comparisons between Uber and Lyft.

In [10]:
auth_flow = ClientCredentialGrant(
    'gRUenY4LPYg_',
    'dFyiT-f23Jwmo_A7n2xGfzp_WcWvBIi8',
    'public',
    )
lyft_session = auth_flow.get_session()

#Use the same location
df1.head()

df2.head()

#Get the ride type, Introduction to different type of lyft:https://developer.lyft.com/docs/glossary
from lyft_rides.client import LyftRidesClient

lyft_client=LyftRidesClient(lyft_session)
lyft_type_response = lyft_client.get_ride_types(df1['coordinates_latitude'][0], df1['coordinates_longitude'][0])
ride_types = lyft_type_response.json.get('ride_types')
print(ride_types)

[{'display_name': 'Lyft Line', 'category_key': 'courier', 'image_url': 'https://cdn.lyft.com/assets/car_standard.png', 'pricing_details': {'base_charge': 210, 'cost_per_mile': 135, 'cost_per_minute': 21, 'cost_minimum': 350, 'trust_and_service': 185, 'currency': 'USD', 'cancel_penalty_amount': 500}, 'seats': 2, 'ride_type': 'lyft_line'}, {'display_name': 'Lyft', 'category_key': 'standard', 'image_url': 'https://cdn.lyft.com/assets/car_standard.png', 'pricing_details': {'base_charge': 210, 'cost_per_mile': 135, 'cost_per_minute': 21, 'cost_minimum': 500, 'trust_and_service': 185, 'currency': 'USD', 'cancel_penalty_amount': 500}, 'seats': 4, 'ride_type': 'lyft'}, {'display_name': 'Lyft Plus', 'category_key': 'plus', 'image_url': 'https://cdn.lyft.com/assets/car_plus.png', 'pricing_details': {'base_charge': 350, 'cost_per_mile': 250, 'cost_per_minute': 35, 'cost_minimum': 600, 'trust_and_service': 210, 'currency': 'USD', 'cancel_penalty_amount': 500}, 'seats': 6, 'ride_type': 'lyft_plus'}

Verification of minimum distance

In [11]:
#Get the ride estimated cost
distance = vincenty(start_loc, end_loc).miles
# We set the condition that the distance should be greater than 1 mile for entering the loop
if(distance > 1):
    print(distance)
    # Calling the api and give the array as the source and destination to get the price estiamtes for a ride
    lyft_price_response = lyft_client.get_cost_estimates(
    start_latitude= df1['coordinates_latitude'][0],
    start_longitude= df1['coordinates_longitude'][0],
    end_latitude=  df2['coordinates_latitude'][0],
    end_longitude= df2['coordinates_longitude'][0]
    )
    # The json file has price estimate for various type of rides offered by Lyft
    lyft_rides = lyft_price_response.json.get('cost_estimates')
    print(lyft_rides)

1.0299291303422835
[{'ride_type': 'lyft_line', 'estimated_duration_seconds': 579, 'estimated_distance_miles': 3.1, 'price_quote_id': 'ec59ea917b67f38a341af35fe7f8c1e0d61316d877db34975dd83911e855b76d', 'estimated_cost_cents_max': 367, 'primetime_percentage': '0%', 'is_valid_estimate': True, 'currency': 'USD', 'cost_token': None, 'estimated_cost_cents_min': 367, 'display_name': 'Lyft Line', 'primetime_confirmation_token': None, 'can_request_ride': True}, {'ride_type': 'lyft', 'estimated_duration_seconds': 579, 'estimated_distance_miles': 3.1, 'price_quote_id': 'ec59ea917b67f38a341af35fe7f8c1e0d61316d877db34975dd83911e855b76d', 'estimated_cost_cents_max': 1020, 'primetime_percentage': '0%', 'is_valid_estimate': True, 'currency': 'USD', 'cost_token': None, 'estimated_cost_cents_min': 1020, 'display_name': 'Lyft', 'primetime_confirmation_token': None, 'can_request_ride': True}, {'ride_type': 'lyft_plus', 'estimated_duration_seconds': 579, 'estimated_distance_miles': 3.1, 'price_quote_id': '

Adding date-time and inserting start and end locations to the json file and then converting them to a dataframe and apending them to a csv file.

In [None]:
# The following for loop will allow us to iterate through all the rows to add time, day, date, start location, end location and the coordinates
for rides in lyft_rides:
    rides["time"] = dt.strftime('%H:%M:%S')
    rides['day'] = dt.strftime('%A')
    rides['date'] = dt.strftime('%B %d, %Y')
    rides["start_latitude"] = df1['coordinates_latitude'][0]
    rides["start_longitude"] = df1['coordinates_longitude'][0]
    rides["end_latitude"] = df2['coordinates_latitude'][0]
    rides["end_longitude"] = df2['coordinates_longitude'][0]
    rides['start_location'] = df1['name'][0]
    rides['end_location'] = df2['name'][0]
# Create a data frame for the lyft
df_lyft = pd.DataFrame(lyft_rides)
df_lyft

file_name_lyft = os.path.join(os.getcwd(), 'lyft_test.csv')
df_lyft.to_csv(file_name_lyft, sep=',', encoding='utf-8', index=False
               
# to append when sending to server
with open('lyft_test.csv', 'a') as f:
    df_lyft.to_csv(f, sep=',', encoding='utf-8', index=False, header=False)

## Weather API

Getting the weather data as a feature to add to our dataset for further analysis.

In [13]:
#Get the current weather information from lat and long
#http://api.openweathermap.org/data/2.5/weather?lat=42.37046&lon=-71.10352&appid=119fe664452f079528a64467c793dd7d
lat=str(df1['coordinates_latitude'][0])
long=str(df1['coordinates_longitude'][0])

api_address='http://api.openweathermap.org/data/2.5/weather?lat='+lat+'&lon='+long+'&appid=119fe664452f079528a64467c793dd7d&q='

# real_time=time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())

weather_data = requests.get(api_address).json()

print(weather_data)
# print(real_time)

{'coord': {'lon': -71.07, 'lat': 42.35}, 'weather': [{'id': 800, 'main': 'Clear', 'description': 'clear sky', 'icon': '01n'}], 'base': 'stations', 'main': {'temp': 278.23, 'pressure': 1030, 'humidity': 60, 'temp_min': 276.15, 'temp_max': 281.15}, 'visibility': 16093, 'wind': {'speed': 3.1, 'deg': 230}, 'clouds': {'all': 1}, 'dt': 1524557700, 'sys': {'type': 1, 'id': 1274, 'message': 0.0048, 'country': 'US', 'sunrise': 1524563337, 'sunset': 1524612987}, 'id': 4930956, 'name': 'Boston', 'cod': 200}


In [14]:
weather_data['weather'] = weather_data['weather'][0]['main']
df_weather = json_normalize(weather_data)

In [15]:
# Changing the name of the columns for ease of use
df_weather.columns = df_weather.columns.str.replace(r'[.]', '_')
df_weather.head()
# Dropping columns that are not usefull
df_weather = df_weather[['weather', 'main_temp', 'main_temp_max', 'main_temp_min']]
df_weather.head()

Unnamed: 0,weather,main_temp,main_temp_max,main_temp_min
0,Clear,278.23,281.15,276.15


## License

The content of this project itself is licensed under the Creative Commons Attribution 3.0 license, and the underlying source code used to format and display that content is licensed under the [MIT License](https://github.com/rahilshah10/IS/blob/master/LICENSE).