# Abstract:

#### In this notebook, we are going to extract San Francisco Boutiques information from yelp to further analyze them in our NLP project:

**step1:**  define a function to iterate over all the search result pages and extract all the boutique names and their information.

**step2:**  define a function to return list of reviews with date and rating for a boutique across all of its review pages.

**step3:**  define a function to combine previous steps, iterate over all the result pages and extract all the boutique names, their info and reviews.

# Objectives:

## At the end of this notebook we will have two separate csv files:

### 1. sf_wclothing_boutiques_info.csv

a file with all san francisco women`s clothing boutique names and information including contact_numbers, addresses, URL_pages, price_range and average_ratings

### 2. sf_wclothing_boutiques_review.csv

a file with all san francisco women`s clothing boutique names, number_reviews, reviews, each review dates and ratings.

# Import Libraries

In [1]:
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
%matplotlib inline
from glob import glob
import pickle

#import bs4 as bs or:
from bs4 import BeautifulSoup
#BeautifulSoup is a Python library for pulling data out of HTML and XML files

import re
import requests
import urllib.request as url
#rllib.request is a Python module for fetching URLs (Uniform Resource Locators).It offers a simple
#interface, in the form of the urlopen function.

from time import sleep
#Need to use a delay between page scrapes in order to limit getting blocked by Yelp

# step 1:

#### 1-1. Set main and secondary attributes and class names to each of required features:

In [2]:
# for all the boutiques in yelp search result:
main_attributes_class="lemon--div__373c0__1mboc mainAttributes__373c0__1r0QA arrange-unit__373c0__o3tjT arrange-unit-fill__373c0__3Sfw1 border-color--default__373c0__3-ifU"
# including:
business_name_class="lemon--div__373c0__1mboc businessName__373c0__1fTgn display--inline-block__373c0__1ZKqC border-color--default__373c0__3-ifU"
rating_class="lemon--span__373c0__3997G display--inline__373c0__3JqBP border-color--default__373c0__3-ifU"
price_range_class="lemon--span__373c0__3997G text__373c0__2Kxyz priceRange__373c0__2DY87 text-color--black-extra-light__373c0__2OyzO text-align--left__373c0__2XGa- text-bullet--after__373c0__3fS1Z"
review_count_class="lemon--span__373c0__3997G text__373c0__2Kxyz reviewCount__373c0__2r4xT text-color--black-extra-light__373c0__2OyzO text-align--left__373c0__2XGa-"

secondary_attributes_class="lemon--div__373c0__1mboc secondaryAttributes__373c0__7bA0w arrange-unit__373c0__o3tjT border-color--default__373c0__3-ifU"
# inclufing:
business_phonenumber_class="lemon--div__373c0__1mboc display--inline-block__373c0__1ZKqC border-color--default__373c0__3-ifU"
business_address_class="lemon--span__373c0__3997G raw__373c0__3rcx7"

#### 1-2. Define a function to iterate over all the result pages and extract all the boutique names and their information:

In [3]:
#find_all returns a ResultSet object which you can iterate over using a for loop
def retrieve_boutique_info(num_stores):
    boutique_name=[]
    rating=[]
    number_review=[]
    price_range=[]
    url_address=[]
    phone_number=[]
    address=[]

    #total num_stores in all pages = 240
    for i in range(0, num_stores, 10):
        
        # 1.to track the process, print("search_pages:",i)
        #print(i)
        
        # 2.to iterate over all pages and stores find a url pattern for all search pages:
        url="https://www.yelp.com/search?find_desc=Clothing%20Boutiques&find_loc=San%20Francisco%2C%20CA&ns=1&cflt=womenscloth&start={}".format(i)
        
        # 3.Make a get request to retrieve each page
        html_page = requests.get(url)
        
        # 4.Pass the page contents to beautiful soup for parsing
        soup = BeautifulSoup(html_page.content, 'html.parser')
        
        # 5.find all the div classes/secondary_attributes which refers to all the stores listed on the search result page and set them to mains and secondary_attributes 
        mains=soup.find_all("div",{"class":main_attributes_class})
        secondaries=soup.find_all("div",{"class":secondary_attributes_class})
        
        for main in mains:
            try:
                boutique_name.append(main.find("a").text)
            except:
                boutique_name.append(None)
            try:
                rating.append(main.find("span",{"class":rating_class}).div.get("aria-label"))
            except:
                rating.append(None)
            try:
                number_review.append(main.find("span",{"class":review_count_class}).text)
            except:
                number_review.append(None)
            try:
                price_range.append(main.find("span",{"class":price_range_class}).get_text())
            except:
                price_range.append(None)
            try:
                base_url="https://www.yelp.com"
                business_name_url=main.find('a').attrs['href']
                url_address.append(base_url+business_name_url)
            except:
                url_address.append(None)
        
        
        for secondary in secondaries:
            try:
                phone_number.append(secondary.find("p").text)
            except:
                phone_number.append(None)
            try:
                address.append(secondary.find("span",{"class":business_address_class}).get_text())
            except:
                address.append(None) 
        
        
        #Random delay between 1 and 4 seconds to prevent getting blocked
        sleep(np.random.randint(1,3))
    
    data={"boutique_name":boutique_name,"rating":rating,"number_reviews":number_review,
          "price_range":price_range,"phone_number":phone_number,"address":address,"url_address":url_address}
    df=pd.DataFrame(data)
    
    return df

#### 1-3. Save the results into a csv file "sf_wclothing_boutiques_info":

In [5]:
#create the boutiques_info dataframe including all the info for all the boutiques in yelp search result:
boutiques_info_df = retrieve_boutique_info(num_stores=240)

#check the result for duplicate boutique names and their info:
pd.set_option('display.max_rows', 100)
pd.options.display.max_colwidth = 100
boutiques_info_df.head(50)

Unnamed: 0,boutique_name,rating,number_reviews,price_range,phone_number,address,url_address
0,Intentionally Blank,5 star rating,1,,(415) 964-8436,1360 Valencia St,https://www.yelp.com/adredir?ad_business_id=43vmshloeCqt3Sc9ySXP8g&campaign_id=Fqp0bHjvWT9HZL2rV...
1,Kisha Studio Fashion Boutique,5 star rating,121,$$,(415) 422-0468,210 Clement St,https://www.yelp.com/biz/kisha-studio-fashion-boutique-san-francisco-2?osq=Clothing+Boutiques
2,Morning Lavender,4 star rating,28,$$,(650) 797-0686,1846 Union St,https://www.yelp.com/biz/morning-lavender-san-francisco-2?osq=Clothing+Boutiques
3,Onyx,4.5 star rating,17,$$,(415) 431-6699,289 Divisadero St,https://www.yelp.com/biz/onyx-san-francisco-3?osq=Clothing+Boutiques
4,Wild Feather,5 star rating,29,$$$,(415) 786-2614,597 Haight St,https://www.yelp.com/biz/wild-feather-san-francisco?osq=Clothing+Boutiques
5,Two Birds,4 star rating,24,$$$,(415) 285-1840,1309 Castro St,https://www.yelp.com/biz/two-birds-san-francisco-2?osq=Clothing+Boutiques
6,Gravel & Gold,4.5 star rating,32,$$$,(415) 552-0112,3266 21st St,https://www.yelp.com/biz/gravel-and-gold-san-francisco?osq=Clothing+Boutiques
7,Current Clothing,5 star rating,23,$$,(415) 400-5517,1738 Union St,https://www.yelp.com/biz/current-clothing-san-francisco?osq=Clothing+Boutiques
8,ANOMIE,4.5 star rating,15,$$$,(415) 872-9943,2149 Union St,https://www.yelp.com/biz/anomie-san-francisco?osq=Clothing+Boutiques
9,Siren Boutique,5 star rating,12,$$,(415) 702-6747,244 W Portal Ave,https://www.yelp.com/biz/siren-boutique-san-francisco-4?osq=Clothing+Boutiques


In [6]:
# duplicatedrows = boutiques_info_df[boutiques_info_df.duplicated()]
# print(duplicatedrows)

#drop all rows with duplicte boutique_names which are mostly adds:
boutiques_info_df_dropped = boutiques_info_df.drop_duplicates(subset ="boutique_name", keep = False)#, inplace = True) 

#reset the index after dropping duplicates
boutiques_info_df_dropped_reset = boutiques_info_df.reset_index(drop=True)

boutiques_info_df_dropped_reset.head(50)

Unnamed: 0,boutique_name,rating,number_reviews,price_range,phone_number,address,url_address
0,Intentionally Blank,5 star rating,1,,(415) 964-8436,1360 Valencia St,https://www.yelp.com/adredir?ad_business_id=43vmshloeCqt3Sc9ySXP8g&campaign_id=Fqp0bHjvWT9HZL2rV...
1,Kisha Studio Fashion Boutique,5 star rating,121,$$,(415) 422-0468,210 Clement St,https://www.yelp.com/biz/kisha-studio-fashion-boutique-san-francisco-2?osq=Clothing+Boutiques
2,Morning Lavender,4 star rating,28,$$,(650) 797-0686,1846 Union St,https://www.yelp.com/biz/morning-lavender-san-francisco-2?osq=Clothing+Boutiques
3,Onyx,4.5 star rating,17,$$,(415) 431-6699,289 Divisadero St,https://www.yelp.com/biz/onyx-san-francisco-3?osq=Clothing+Boutiques
4,Wild Feather,5 star rating,29,$$$,(415) 786-2614,597 Haight St,https://www.yelp.com/biz/wild-feather-san-francisco?osq=Clothing+Boutiques
5,Two Birds,4 star rating,24,$$$,(415) 285-1840,1309 Castro St,https://www.yelp.com/biz/two-birds-san-francisco-2?osq=Clothing+Boutiques
6,Gravel & Gold,4.5 star rating,32,$$$,(415) 552-0112,3266 21st St,https://www.yelp.com/biz/gravel-and-gold-san-francisco?osq=Clothing+Boutiques
7,Current Clothing,5 star rating,23,$$,(415) 400-5517,1738 Union St,https://www.yelp.com/biz/current-clothing-san-francisco?osq=Clothing+Boutiques
8,ANOMIE,4.5 star rating,15,$$$,(415) 872-9943,2149 Union St,https://www.yelp.com/biz/anomie-san-francisco?osq=Clothing+Boutiques
9,Siren Boutique,5 star rating,12,$$,(415) 702-6747,244 W Portal Ave,https://www.yelp.com/biz/siren-boutique-san-francisco-4?osq=Clothing+Boutiques


In [7]:
#save the dataframe as a csv file:
boutiques_info_df_dropped_reset.to_csv("sf_wclothing_boutiques_info.csv")

In [8]:
#save the dataframe as a pickle file for further NLP analysis:
#Write data to file
with open('boutique_info.pickle','wb') as f:
    pickle.dump(boutiques_info_df_dropped_reset,f,pickle.HIGHEST_PROTOCOL)