*# Project: Web Scraping TripAdvisor restaurants located in Mexico.

***Goal*** : The main goal of this project is to scrap the contents of tripadvisor and create an interactive map in which the user can see how many restaurants are available in the country, and get more info about the restaurant when hovering the mouse near the location of the restaurant.

In [None]:
import os
import glob
import pandas as pd
from geopy.geocoders import GoogleV3

## 1. Scrape the data

The first step is to scrape the data. I used scrapy to perform this task. This library creates a spider that crawls the website and gets the requested info using css or xpath selectors.

For TripAdvisor, we wanted to get:
* Name of the restaurant;
* Address;
* Mail;
* Phone number;
* Rating out of 5 stars;
* Number of reviews;
* Price range; and
* Type of cuisine.


In order to extract all possible data from Trip Advisor, we used a special type of spyder called CrawlSpider, which contains rules attributes that specify how to extract the links from a page and which callbacks should be called for those links. They are handled by the default parse() method implemented in that class.

For each of the 32 states of Mexico, I used the spider file *trip_states.py*, in which I specified:
* The pagination rule, so the spider could deal with the pagination
* The allowed domains for the spider (<www.tripadvisor.com.mx>);
* The starting URL in which the spider would start scraping the data. As an example, for the state of Yucatan, **[this was the starting URL that the spider used to perform the web scraping](https://www.tripadvisor.com.mx/Restaurants-g1632078-Campeche_Yucatan_Peninsula.html)**;
* The creation of a parse function, which selects, using xpath selectors, specific sections of the website and scrapes the data.

After running the spider, the data was stored as a csv file, with the following columns:
* ***entity/city***: name of the state
* ***name***: name of the restaurant
* ***address***: address of the restaurant
* ***mail***: email
* ***phone_number***: phone number
* ***score***: score out of 5 stars
* ***no_reviews***: number of reviews that the restaurant has gotten
* ***ranking_subclass_1***, ***ranking_subclass_2***, and ***ranking_subclass_3***: the restaurant's cuisine specialty state ranking
* ***ranking_city_1***,***ranking_city_2***, and ***ranking_city_3***: the overall city ranking
* ***price_range***: the average price range of the ticket
* ***food_type***: the cuisine specialty

In the end, I ended up with 32 csv files, one for each state. ![Here is an example of the resulting csv](csv_example.PNG "Example csv").

## 2. Cleaning the data

Here, the goal was to clean the csv files and combine them into one single csv file.

#### Combining the csv files into one single csv

In [None]:
#change directory to a folder in which all csv are located
os.chdir('D:/Users/rsilva/Documents/Python Scripts/webscrap/projects/tripadvisor_states/scraped_files/')

extension = 'csv' #define file extension that glob is going to look
all_filenames = [i for i in glob.glob('*.{}'.format(extension))] #get file names for all csv files located in the folder.

#concatenate all csv files into one single csv, dropping duplicates
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
combined_csv = combined_csv.drop_duplicates()

#### Cleaning the data

To clean the data, I created a function, ***clean_df***, that does the following tasks:
* For the *mail* column, it removes the string "mailto:"
* For the phone numbers, it removes the string "tel:"
* Removes special characters from ranking columns

In [None]:
def clean_df(df):
    '''
    remove special characters and unwanted words from scraped data
    '''

    df['mail'] = df['mail'].str.replace('mailto:','').str.replace('\?subject=\?','')
    df['phone_number'] = df['phone_number'].str.replace('tel:','')
    df['ranking_subclass_1'] = df['ranking_subclass_1'].str.replace(u'N.º\xa0',u'')
    df['ranking_subclass_2'] = df['ranking_subclass_2'].str.replace(' de ','')
    df['ranking_city_1'] = df['ranking_city_1'].str.replace(u'N.º\xa0',u'')
    df['ranking_city_2'] = df['ranking_city_2'].str.replace(' de ','')

    return df


## 3. Getting the coordinates for each restaurant scraped

One crucial step to map the restaurants is to get the coordinates. One way to obtain these coordinates is to use the library ***geopy***. Geopy locates coordinates of addresses, cities, countries, and landmarks across the globe using third-party geocoders.

In this project, I used the **[Google Maps V3 API](https://geopy.readthedocs.io/en/stable/#googlev3)**, which requires the user to create an API key from Google <https://developers.google.com/maps/documentation/geolocation/get-api-key>.

In [None]:
def get_coordinates(df):
    '''
    get longitude and latitude from address using geopy and google geolocator
    '''
    geolocator = GoogleV3(api_key='INSERT VALID API HERE') #must insert valid google API

    latitudes=[]
    longitudes=[]
    addresses = df['address'].tolist()
    for address in addresses:
        try:

            location = geolocator.geocode(address)
            if location is not None:
                latitudes.append(location.latitude)
                longitudes.append(location.longitude)
            else:
                print(f"Could not find Location for {address!r}")
                latitudes.append('NA')
                longitudes.append('NA')
        except GeocoderUnavailable as e:
            latitudes.append('NA')
            longitudes.append('NA')

    df = df.assign(lat = latitudes)
    df = df.assign(lon = longitudes)

    return df

#export to csv
df.to_csv('tripadvisor_restaurants.csv', index = False, encoding='utf-8-sig')

## 4. Map the data!

The final step is to create the map. I used the **[Folium MarkerCluster plugin](https://python-visualization.github.io/folium/plugins.html#folium.plugins.MarkerCluster)**, which enables us to cluster several markers together and make the map more visually appealing.

In [None]:
df = pd.read_csv('D:/Users/rsilva/Documents/Python Scripts/webscrap/projects/tripadvisor/tripadvisor_restaurants_final.csv', encoding = 'utf-8-sig')
df['name'] = df['name'].replace(r'[^\w\s]|_', '', regex=True)

df['name'] = df['name'].astype(pd.StringDtype())

print(df.dtypes)
print(df.count())
#create map object with centering at coordinates mean
m = folium.Map(location=df[["lat", "lon"]].mean().to_list(), zoom_start=5)

#filter so restaurants shown are in Mexico only
df = df[df['lon'].between(-118.5, -86.3)]
df = df[df['lat'].between(14.2, 33.3)]


#create popup function to customize popup
def popup_html(row):
    i = row
    restaurant_name = df['name'].iloc[i]
    restaurant_mail = df['mail'].iloc[i]
    restaurant_phone = df['phone_number'].iloc[i]
    reviews = df['no_reviews'].iloc[i]
    score = df['score'].iloc[i]
    food_types = df['food_type'].iloc[i]

    left_col_color = "#0063B2FF"
    right_col_color = "#9CC3D5FF"

    html = """<!DOCTYPE html>
    <html>
    <head>
    <h4 style="margin-bottom:10"; width="200px">{}</h4>""".format(restaurant_name) + """
    </head>
        <table style="height: 126px; width: 350px;">
    <tbody>
    <tr>
    <td style="background-color: """ + left_col_color + """;"><span style="color: #ffffff;">Phone Number</span></td>
    <td style="width: 150px;background-color: """ + right_col_color + """;">{}</td>""".format(restaurant_phone) + """
    </tr>
    <tr>
    <td style="background-color: """ + left_col_color + """;"><span style="color: #ffffff;">Contact E-mail</span></td>
    <td style="width: 150px;background-color: """ + right_col_color + """;">{}</td>""".format(restaurant_mail) + """
    </tr>
    <tr>
    <td style="background-color: """ + left_col_color + """;"><span style="color: #ffffff;">Number of reviews</span></td>
    <td style="width: 150px;background-color: """ + right_col_color + """;">{}</td>""".format(reviews) + """
    </tr>
    <tr>
    <td style="background-color: """ + left_col_color + """;"><span style="color: #ffffff;">Score</span></td>
    <td style="width: 150px;background-color: """ + right_col_color + """;">{}</td>""".format(score) + """
    </tr>
    <tr>
    <td style="background-color: """ + left_col_color + """;"><span style="color: #ffffff;">Type of Food</span></td>
    <td style="width: 150px;background-color: """ + right_col_color + """;">{}</td>""".format(food_types) + """
    </tr>
    </tbody>
    </table>
    </html>
    """
    return html




marker_cluster = MarkerCluster().add_to(m)

for i in range(len(df)):
    location = (df["lat"].iloc[i], df["lon"].iloc[i])
    html = popup_html(i)
    iframe = branca.element.IFrame(html = html, width=510, height=280)
    # my_string = 'name: {}\n, phone: {}\n, food_type: {}'.format(r['name'], r['phone_number'], r['food_type'])
    popup = folium.Popup(folium.Html(html, script=True), max_width=500)
    folium.Marker(location=location, popup = popup).add_to(marker_cluster)




m.save('tripadvisor_restaurants_mexico2.html')

And that's a wrap! [Here's the resulting TripAdvisor map](tripadvisor_restaurants_mexico2.html)

![Here is an example of the map 1](map01.PNG "Map")

As you zoom into the map, the cluster change size ![Here is an example of the map](map02.PNG "Map zoomed in")

As you can see, the map clusters several restaurants that are located nearby ![Here is an example of the map](map03.PNG "Map zoomed in")

Finally, when you click one of the markers, you can see more details about the restaurant ![Here is an example of the map](map04.PNG "Restaurant details")