# Abstract:

#### In this notebook, we are going to extract San Francisco Boutiques information from yelp to further analyze them in our NLP project:

**step1:**  define a function to iterate over all the search result pages and extract all the boutique names and their information.

**step2:**  define a function to return list of reviews with date and rating for a boutique across all of its review pages.

**step3:**  define a function to combine previous steps, iterate over all the result pages and extract all the boutique names, their info and reviews.

# Objectives:

## At the end of this notebook we will have two separate csv files:

### 1. sf_wclothing_boutiques_info.csv

a file with all san francisco women`s clothing boutique names and information including contact_numbers, addresses, URL_pages, price_range and average_ratings

### 2. sf_wclothing_boutiques_review.csv

a file with all san francisco women`s clothing boutique names, number_reviews, reviews, each review dates and ratings.



# Import Libraries

In [33]:
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
%matplotlib inline
from glob import glob
import pickle

#import bs4 as bs or:
from bs4 import BeautifulSoup
#BeautifulSoup is a Python library for pulling data out of HTML and XML files

import re
import requests
import urllib.request as url
#rllib.request is a Python module for fetching URLs (Uniform Resource Locators).It offers a simple
#interface, in the form of the urlopen function.

from time import sleep
#Need to use a delay between page scrapes in order to limit getting blocked by Yelp

# step 1:

#### 1-1. Set main and secondary attributes and class names to each of required features:

In [2]:
# for all the boutiques in yelp search result:
main_attributes_class="lemon--div__373c0__1mboc mainAttributes__373c0__1r0QA arrange-unit__373c0__o3tjT arrange-unit-fill__373c0__3Sfw1 border-color--default__373c0__3-ifU"
# including:
business_name_class="lemon--div__373c0__1mboc businessName__373c0__1fTgn display--inline-block__373c0__1ZKqC border-color--default__373c0__3-ifU"
rating_class="lemon--span__373c0__3997G display--inline__373c0__3JqBP border-color--default__373c0__3-ifU"
price_range_class="lemon--span__373c0__3997G text__373c0__2Kxyz priceRange__373c0__2DY87 text-color--black-extra-light__373c0__2OyzO text-align--left__373c0__2XGa- text-bullet--after__373c0__3fS1Z"
review_count_class="lemon--span__373c0__3997G text__373c0__2Kxyz reviewCount__373c0__2r4xT text-color--black-extra-light__373c0__2OyzO text-align--left__373c0__2XGa-"

secondary_attributes_class="lemon--div__373c0__1mboc secondaryAttributes__373c0__7bA0w arrange-unit__373c0__o3tjT border-color--default__373c0__3-ifU"
# inclufing:
business_phonenumber_class="lemon--div__373c0__1mboc display--inline-block__373c0__1ZKqC border-color--default__373c0__3-ifU"
business_address_class="lemon--span__373c0__3997G raw__373c0__3rcx7"

#### 1-2. Define a function to iterate over all the result pages and extract all the boutique names and their information:

In [3]:
#find_all returns a ResultSet object which you can iterate over using a for loop
def retrieve_boutique_info(num_stores):
    boutique_name=[]
    rating=[]
    number_review=[]
    price_range=[]
    url_address=[]
    phone_number=[]
    address=[]

    #total num_stores in all pages = 240
    for i in range(0, num_stores, 10):
        
        # 1.to track the process, print("search_pages:",i)
        #print(i)
        
        # 2.to iterate over all pages and stores find a url pattern for all search pages:
        url="https://www.yelp.com/search?find_desc=Clothing%20Boutiques&find_loc=San%20Francisco%2C%20CA&ns=1&cflt=womenscloth&start={}".format(i)
        
        # 3.Make a get request to retrieve each page
        html_page = requests.get(url)
        
        # 4.Pass the page contents to beautiful soup for parsing
        soup = BeautifulSoup(html_page.content, 'html.parser')
        
        # 5.find all the div classes/secondary_attributes which refers to all the stores listed on the search result page and set them to mains and secondary_attributes 
        mains=soup.find_all("div",{"class":main_attributes_class})
        secondaries=soup.find_all("div",{"class":secondary_attributes_class})
        
        for main in mains:
            try:
                boutique_name.append(main.find("a").text)
            except:
                boutique_name.append(None)
            try:
                rating.append(main.find("span",{"class":rating_class}).div.get("aria-label"))
            except:
                rating.append(None)
            try:
                number_review.append(main.find("span",{"class":review_count_class}).text)
            except:
                number_review.append(None)
            try:
                price_range.append(main.find("span",{"class":price_range_class}).get_text())
            except:
                price_range.append(None)
            try:
                base_url="https://www.yelp.com"
                business_name_url=main.find('a').attrs['href']
                url_address.append(base_url+business_name_url)
            except:
                url_address.append(None)
        
        
        for secondary in secondaries:
            try:
                phone_number.append(secondary.find("p").text)
            except:
                phone_number.append(None)
            try:
                address.append(secondary.find("span",{"class":business_address_class}).get_text())
            except:
                address.append(None) 
        
        
        #Random delay between 1 and 4 seconds to prevent getting blocked
        sleep(np.random.randint(1,3))
    
    data={"boutique_name":boutique_name,"rating":rating,"number_reviews":number_review,
          "price_range":price_range,"phone_number":phone_number,"address":address,"url_address":url_address}
    df=pd.DataFrame(data)
    
    return df

#### 1-3. Save the results into a csv file "sf_wclothing_boutiques_info":

In [14]:
#create the boutiques_info dataframe including all the info for all the boutiques in yelp search result:
boutiques_info_df = retrieve_boutique_info(num_stores=240)

#check the result for duplicate boutique names and their info:
pd.set_option('display.max_rows', 100)
pd.options.display.max_colwidth = 100
boutiques_info_df.head(50)

Unnamed: 0,boutique_name,rating,number_reviews,price_range,phone_number,address,url_address
0,Marzel’s,4 star rating,12,$$$,(925) 433-9163,1220 Oakland Blvd,https://www.yelp.com/adredir?ad_business_id=bDdNGMfzsk1xDA-7MQEyWg&campaign_id=ey-zPs_Gwfv8KxEff...
1,Kisha Studio Fashion Boutique,5 star rating,120,$$,(415) 422-0468,210 Clement St,https://www.yelp.com/biz/kisha-studio-fashion-boutique-san-francisco-2?osq=Clothing+Boutiques
2,Morning Lavender,4 star rating,28,$$,(650) 797-0686,1846 Union St,https://www.yelp.com/biz/morning-lavender-san-francisco-2?osq=Clothing+Boutiques
3,Onyx,4.5 star rating,17,$$,(415) 431-6699,289 Divisadero St,https://www.yelp.com/biz/onyx-san-francisco-3?osq=Clothing+Boutiques
4,Wild Feather,5 star rating,29,$$$,(415) 786-2614,597 Haight St,https://www.yelp.com/biz/wild-feather-san-francisco?osq=Clothing+Boutiques
5,Asmbly Hall - formerly on Fillmore Street,4.5 star rating,26,$$,(415) 801-5862,624 Divisadero St,https://www.yelp.com/biz/asmbly-hall-formerly-on-fillmore-street-san-francisco?osq=Clothing+Bout...
6,Two Birds,4 star rating,24,$$$,(415) 285-1840,1309 Castro St,https://www.yelp.com/biz/two-birds-san-francisco-2?osq=Clothing+Boutiques
7,Gravel & Gold,4.5 star rating,32,$$$,(415) 552-0112,3266 21st St,https://www.yelp.com/biz/gravel-and-gold-san-francisco?osq=Clothing+Boutiques
8,Current Clothing,5 star rating,23,$$,(415) 400-5517,1738 Union St,https://www.yelp.com/biz/current-clothing-san-francisco?osq=Clothing+Boutiques
9,ANOMIE,4.5 star rating,15,$$$,(415) 872-9943,2149 Union St,https://www.yelp.com/biz/anomie-san-francisco?osq=Clothing+Boutiques


In [30]:
# duplicatedrows = boutiques_info_df[boutiques_info_df.duplicated()]
# print(duplicatedrows)

#drop all rows with duplicte boutique_names which are mostly adds:
boutiques_info_df_dropped = boutiques_info_df.drop_duplicates(subset ="boutique_name", keep = False)#, inplace = True) 

#reset the index after dropping duplicates
boutiques_info_df_dropped_reset = boutiques_info_df.reset_index(drop=True)

boutiques_info_df_dropped_reset.head(50)

Unnamed: 0,boutique_name,rating,number_reviews,price_range,phone_number,address,url_address
0,Kisha Studio Fashion Boutique,5 star rating,120.0,$$,(415) 422-0468,210 Clement St,https://www.yelp.com/biz/kisha-studio-fashion-boutique-san-francisco-2?osq=Clothing+Boutiques
1,Morning Lavender,4 star rating,28.0,$$,(650) 797-0686,1846 Union St,https://www.yelp.com/biz/morning-lavender-san-francisco-2?osq=Clothing+Boutiques
2,Onyx,4.5 star rating,17.0,$$,(415) 431-6699,289 Divisadero St,https://www.yelp.com/biz/onyx-san-francisco-3?osq=Clothing+Boutiques
3,Wild Feather,5 star rating,29.0,$$$,(415) 786-2614,597 Haight St,https://www.yelp.com/biz/wild-feather-san-francisco?osq=Clothing+Boutiques
4,Asmbly Hall - formerly on Fillmore Street,4.5 star rating,26.0,$$,(415) 801-5862,624 Divisadero St,https://www.yelp.com/biz/asmbly-hall-formerly-on-fillmore-street-san-francisco?osq=Clothing+Boutiques
5,Two Birds,4 star rating,24.0,$$$,(415) 285-1840,1309 Castro St,https://www.yelp.com/biz/two-birds-san-francisco-2?osq=Clothing+Boutiques
6,Gravel & Gold,4.5 star rating,32.0,$$$,(415) 552-0112,3266 21st St,https://www.yelp.com/biz/gravel-and-gold-san-francisco?osq=Clothing+Boutiques
7,Current Clothing,5 star rating,23.0,$$,(415) 400-5517,1738 Union St,https://www.yelp.com/biz/current-clothing-san-francisco?osq=Clothing+Boutiques
8,ANOMIE,4.5 star rating,15.0,$$$,(415) 872-9943,2149 Union St,https://www.yelp.com/biz/anomie-san-francisco?osq=Clothing+Boutiques
9,Chococo,4.5 star rating,26.0,$$,(408) 888-7910,455 Clement St,https://www.yelp.com/biz/chococo-san-francisco-6?osq=Clothing+Boutiques


In [31]:
#save the dataframe as a csv file:
# boutiques_info_df_dropped_reset.to_csv("sf_wclothing_boutiques_info.csv")

In [53]:
#save the dataframe as a pickle file for further NLP analysis:
#Write data to file
with open('boutique_info.pickle','wb') as f:
    pickle.dump(boutiques_info_df_dropped_reset,f,pickle.HIGHEST_PROTOCOL)

# step 2:
#### 2-1. Set main attributes and class names to each of required features for **each boutique** review across all of its review pages:

In [22]:
# for each boutique in its yelp_url_page:
review_mains_class = "lemon--div__373c0__1mboc arrange-unit__373c0__o3tjT arrange-unit-grid-column--8__373c0__2dUx_ border-color--default__373c0__3-ifU"
# including:
review_reviews_class = "lemon--span__373c0__3997G raw__373c0__3rKqk"
review_dates_class = "lemon--span__373c0__3997G text__373c0__2Kxyz text-color--mid__373c0__jCeOG text-align--left__373c0__2XGa-"
review_stars_class = "lemon--span__373c0__3997G display--inline__373c0__3JqBP border-color--default__373c0__3-ifU"

#### 2-2. Define a function to iterate over all the review pages of **each boutique** and extract all the information:

In [23]:
#retrieve reviews, date and rate of reviews for each boutique:
#each page has 20 reviews
def retrieve_reviews_info(boutique_name, boutique_url_address, number_reviews):
    url = boutique_url_address  
    total_reviews=[]
    total_review_dates=[]
    total_review_stars=[]
    for i in range(0, int(number_reviews)//20 + 1):
        url_page = url + "&start={}".format(i*20)
        html_page = requests.get(url_page)
        soup = BeautifulSoup(html_page.content, 'html.parser')
        review_mains = soup.find_all("div",{"class":review_mains_class})
        
        for review_main in review_mains:
            try:
                total_reviews.append(review_main.find("span",{"class":review_reviews_class}).text)
            except:
                total_reviews.append(None)
            try:
                total_review_stars.append(review_main.find("span",{"class":review_stars_class}).div.get("aria-label"))
            except:
                total_review_stars.append(None)
            try:
                total_review_dates.append(review_main.find("span",{"class":review_dates_class}).text)
            except:
                total_review_dates.append(None)
    
    data = {"review":total_reviews,"star_rating":total_review_stars,"date":total_review_dates}
    coulmns = ["review","star_rating","date"]
    review_info = pd.DataFrame(data)
                
    return review_info.dropna(subset = ["star_rating","date","review"], how="any")

In [24]:
#check the results for one store:
pd.options.display.max_colwidth = 500
boutique_name = 'Onyx'
boutique_url_address ="https://www.yelp.com/biz/onyx-san-francisco-3?osq=Clothing+Boutiques"
number_reviews = 17
#to get the results as a dataframe for all three attributes:
myreviews = retrieve_reviews_info(boutique_name, boutique_url_address, number_reviews)
myreviews

Unnamed: 0,review,star_rating,date
1,"The most magical store full of so many beautiful things. A ton of my favorite things are from Onyx. I constantly get compliments on the Molly M Designs wallet that I bought there. The owners are also super nice and helpful, it's always a treat to shop there.",5 star rating,4/2/2020
2,"Onyx is literally a gem on Divis!Quite a gem +'s:+ Fetching collection for both women and men! The clothing designs reflect someone who understands style without being terribly trendy. Quality of materials show that consumers don't want something from Forever 21.+ Jewelry counter is a gem in itself. Don't forget to browse. Very interesting designs that are unique and worth a look or try.+ Pleasant assistance and someone who has an eye! Don't have her name, but that day, the woman at Onyx mad...",4 star rating,1/1/2019
3,"Just had the same experience as Jessie described. This place seemed to be a total gem and I was loving everything until realizing the shop owner was blatantly ignoring me and my boyfriend, with only us 2 in the shop. I never write reviews but I could not believe how rude this woman was. We spent a good amount of time browsing. When I finally approached the counter to look at jewelry and pick up a card I'm thinking ok now she'll at least acknowledge me.She finally dares to acknowledge my exis...",1 star rating,9/10/2017
4,"Onyx is a beautiful boutique! My wife & love it ! It is filled with art (which rotates), clothing and jewelry. Many of the items are from hard to find designers. The owner, Barb, is just fabulous and friendly (as is her dog, Spike). It's a great addition to the neighborhood! Stop by for unique clothing and gifts.",5 star rating,9/3/2018
5,"A well curated, bright open boutique in the Mission.I wanted to slip into every dress in there! The fabrics are incredible. Some of the softest, lightest, and well made pieces.It is clear every item is picked out for a reason. I was so smitten with each and every cut, pleat, seam and silhouette. They had a particular dress on a mannequin I kept coming back to because I KNEW no matter who put it on it was going to hang beautifully and flatter every curve. I walked back to it about a dozen tim...",5 star rating,7/23/2017
6,"I live in the neighborhood and love this boutique as well as the ladies who run it. It has a lovely selection with a great variety of prices and styles. Quality clothes, shoes, jewelry and a small selection of skincare products. Buy small, buy local!Thanks Onyx!",5 star rating,3/6/2017
7,"Exquisite designers from San Francisco and around the world. The two owners, Shannon and Barb, have an incredible eye for clothing and accessories- love these women! Subtle aroma from a ""Bir Sur"" scented candle and classic rock on vinyl welcome you at the door. You won't be able to leave without your arms weighted down from new styles and gems!",5 star rating,9/5/2018
8,"From the same owners of Onyx in the Mission - comes a new Onyx location - conveniently located on boutique lined Divisadero.Onyx Boutique is full of well chosen garments and accessories. From hip Obey faux red leather jackets lined with casual grey hoodie material, to plush lined casual jackets that make you yearn to never take them off - it is brimming with stylish finds. The space is big and well laid out and will be perfect for hosting fun events and I'm excited to see what other wonderfu...",4 star rating,11/13/2012
9,"This boutique is a San Francisco gem. The clothing, shoes and jewelry are super urban chic. The owners really have their fingers on the pulse of cool SF style. It's an art and gift gallery also. I just love what they carry in this crunchy casual spot. Even the wallpaper is cool as are the nice friendly ladies who own it. Def worth a visiting often, they get new goodies constantly.",4 star rating,5/24/2017
10,"This is a great neighborhood boutique. I always find something, I especially love their jewelry. They carry really unique pieces from local/West Coast makers. I like that there is a range of prices in their clothing. While there are some very pricey things, there are usually things within reach too. I've also had great success with their sample sales..gotten some absolute steals!While they don't carry a huge selection of shoes, what they do have I really like. I've bought a couple great pa...",5 star rating,1/25/2016
