# **Trully challenge for Jr. Data Analysts.**


For this challenge you will need to design an end-to-end solution to do the
following:
- Scrap information from TripAdvisor Mexico (www.tripadvisor.com.mx).
We’re looking for information regarding restaurants in Mexico, the latitude
and longitude, number of reviews, ranking, etc... Get as much information
as you can.
- Clean and treat the data.
- Have fun and go through a data mining process to generate new variables
from the scraped data.
- Which insights did you find most interesting from the data?
- Generate deliverables for the data engineering team: python scripts, and
csv files.

In [42]:
#import libreries
#!sudo pip install requests-html  nums_from_string
from requests_html import HTMLSession
from bs4 import BeautifulSoup, Tag
import pandas as pd
import re
import nums_from_string
import base64

## Introduction
The proposed solution uses the be library
as the main tool.

To obtain the information of the Mexican restaurants of the page (www.tripadvisor.com.mx), the following steps were carried out due to the search architecture of the page:

- Extraction of the information of the cities available to consult on the main page, we obtained from these the name, the rating and the url.

![main](Images/main-page.png)

- From each city we obtained the information of the existing restaurants, in the same way, the name, the rating and the url were obtained.

![second](Images/second-page.png)

- And finally, the corresponding information was obtained from each restaurant, for example: name, cell phone, website, rating, number of reviews, global rating, location (latitude, longitude), type of food, and the rating of Food, Service , Price/quality.

![restaurant](Images/restaurant-pege.png)

# First part

In [43]:
#start session
s=HTMLSession()
inicial_url='https://www.tripadvisor.com.mx'

In [44]:
def get_data(url:str):
  """Function to get data from a html page

  Parameters
  ----------
  url : str
      URL of the page to extract data
  -------
  
  This function reurn a BeautifulSoup conection
  """
  r=s.get(url=url)
  soup=BeautifulSoup(r.content,'html.parser')
  return soup

In [45]:
def get_newpage(soup,i:int):
  """Function to get the link to the new page from pagination

  Parameters
  ----------
  soup : BeautifulSoup()
      Connection of the page from which the pagination is obtained

  i : Number of page to get
  -------
  
  This function reurn link to the new page
  """

  page=soup.find('a', attrs={'data-page-number':str(i)}).get('href') 
  return page

In [46]:
def get_cities_list(soup, cities_catalog):
  """Function to get the cities information in  the page

  Parameters
  ----------
  soup : BeautifulSoup()
      Connection of the page
  -------
  
  This function reurn a list with dictionaries with information of each city with restaurants
  """

  cities=soup.find('ul', class_='geoList')
  cities2=cities.find_all('li')


  for j,element in enumerate(cities2):

              name=element.find('a').text[16:]
              link='https://www.tripadvisor.com.mx'+element.find('a').get('href')
              
              cities_info={
              'popularity':page+j+1,
              'name':name,
              'url':link
              }
              cities_catalog.append(cities_info)

In [47]:
url='https://www.tripadvisor.com.mx/Restaurants-g150768-oa00-Mexico.html#LOCATION_LIST'
soup=get_data(url)
cities=soup.find_all('div', class_='geo_entry')

#primera pagina
cities_catalog=[]
for i,element in enumerate(cities):
  name=element.find('div', class_='geo_name').text[17:-1]
  link='https://www.tripadvisor.com.mx'+element.find('a').get('href')
  cities_info={
      'popularity':i+1,
      'name':name,
      'url':link
  }
  cities_catalog.append(cities_info)

In [48]:
url='https://www.tripadvisor.com.mx/Restaurants-g150768-oa20-Mexico.html#LOCATION_LIST'
soup=get_data(url)
#cities_catalog=[]

#obtain the information of all the cities on the website
i=1
while True:
            
  page=20*i
  url='https://www.tripadvisor.com.mx/Restaurants-g150768-oa'+str(page)+'-Mexico.html#LOCATION_LIST'
  soup=get_data(url)
  
  try:
    get_cities_list(soup, cities_catalog)
  except AttributeError:
    break

  i+=1

## Results

In [49]:
print('There are {} cities with restaurants on the website'.format(len(cities_catalog)))

There are 1063 cities with restaurants on the website


In [50]:
#example of city information
cities_catalog[0]

{'popularity': 1,
 'name': 'Ciudad de México',
 'url': 'https://www.tripadvisor.com.mx/Restaurants-g150800-Mexico_City_Central_Mexico_and_Gulf_Coast.html'}

Save the catalogue in a CSV file and DataFrame for still work

In [51]:
df_cities_catalog=pd.DataFrame(cities_catalog)
df_cities_catalog.to_csv('D:/DIEGO/Documents/CODIGO/challenge/Challenge repository/Data/cities_catalog.csv')

In [52]:
df_cities_catalog

Unnamed: 0,popularity,name,url
0,1,Ciudad de México,https://www.tripadvisor.com.mx/Restaurants-g15...
1,2,Guadalajara,https://www.tripadvisor.com.mx/Restaurants-g15...
2,3,Playa del Carmen,https://www.tripadvisor.com.mx/Restaurants-g15...
3,4,Cancún,https://www.tripadvisor.com.mx/Restaurants-g15...
4,5,Monterrey,https://www.tripadvisor.com.mx/Restaurants-g15...
...,...,...,...
1058,1059,Atotonilco,https://www.tripadvisor.com.mx/Restaurants-g16...
1059,1060,Zinacantán,https://www.tripadvisor.com.mx/Restaurants-g10...
1060,1061,Municipio de Vista Hermosa,https://www.tripadvisor.com.mx/Restaurants-g14...
1061,1062,Purépero,https://www.tripadvisor.com.mx/Restaurants-g31...


# Secound part

For the following parts, only the information corresponding to the restaurants in Mexico City was extracted.

This due to the large amount of data, however a loop can be made that extracts the restaurants of each city.


In [53]:
#Mexico City restaurants page
df_cities_catalog.url[0]
city=df_cities_catalog.name[0]

In [54]:
def get_restaurants(soup,restaurant_directory):
  """Function to get the restaurants webside information in the page

  Parameters
  ----------
  soup : BeautifulSoup()
      Connection of the page

  restaurant_directory: list
      List in wich data is stored
  -------
  
  This function reurn a list with dictionaries with information of restaurant webside
  """
  restaurants=soup.find_all('div', attrs={'class': re.compile('YHnoF Gi o'),'data-test': re.compile('[0-9]\_list\_item') })
  
  for restaurant in restaurants:

              information=restaurant.find('div', class_='RfBGI').find('a')

              popularity=nums_from_string.get_nums(information.text)[0]
              name=re.findall(r'[a-zA-Z].*',information.text)[0]
              link=inicial_url+information.get('href')

              restaurant_info={
                  'popularity':popularity,
                  'name':name,
                  'url':link
              }
              
              restaurant_directory.append(restaurant_info)
               

In [55]:
url=df_cities_catalog.url[0] #Mexico City restaurants page
page=get_data(url)
restaurant_directory=[]

#restaurant information is saved from pagination
i=2
while True:
  page=get_data(url)
  get_restaurants(page,restaurant_directory)

  try:
    new_page=get_newpage(page,i)
  except AttributeError:
    break
  url=inicial_url+new_page
  i+=1

## Results

In [27]:
print('There are {} restaurants in {}'.format(len(restaurant_directory),city))

There are 6730 restaurants in Ciudad de México


In [28]:
#example of restaurant directory information
restaurant_directory[0]

{'popularity': 1,
 'name': 'La Mansion Marriott Reforma',
 'url': 'https://www.tripadvisor.com.mx/Restaurant_Review-g150800-d2394477-Reviews-La_Mansion_Marriott_Reforma_Steakhouse-Mexico_City_Central_Mexico_and_Gulf_Coast.html'}

Save the restaurant directory in a CSV file and DataFrame for still work

In [29]:
df_restaurant_directory=pd.DataFrame(restaurant_directory)
df_restaurant_directory.to_csv('D:/DIEGO/Documents/CODIGO/challenge/Challenge repository/Data/restaurant_directory.csv')

# Thirt part: Restaurant page information

In [30]:
def get_restaurant_information(soup, name, popularity, restaurant_dir):
  """Function to get the restaurant information

  Parameters
  ----------
  soup : BeautifulSoup()
      Connection of the page

  name : str
      Name of restaurant from restaurant directory

  popularity : str
      Popularity of restaurant from restaurant directory
  -------
  
  This function append a dictionary with information of the restaurant to a list
  """

  try: #calification
    cal=float(soup.find('span', attrs={'class':'ZDEqb'}).text)
  except AttributeError:
    cal=None
  
  try: #number of reviews
    num_op=nums_from_string.get_nums(soup.find('span', attrs={'class':'reviews_header_count'}).text)[0]
  except AttributeError:
    num_op=None
  

  restaurant_inf={
      'Popularity':popularity,
      'Name':name,
      'Calfication':cal,
      'Num_reviwes':num_op
  }
  
  try: #phone
    phone=soup.find('a', attrs={'class':'BMQDV _F G- wSSLS SwZTJ','href': re.compile(r'tel.*')}).text
    restaurant_inf['Phone']=phone
  except AttributeError:
    pass

  try: #webside
    web=soup.find('a', attrs={'class':'YnKZo Ci Wc _S C AYHFM'}).get('data-encoded-url')
    web=(base64.b64decode(web))
    web=re.findall(r'http.+\_',str(web))[0][:-1]
    restaurant_inf['Webside']=web
  except (IndexError, AttributeError):
    pass

  try: #location 
    loc=soup.find('a', attrs={'class':'YnKZo Ci Wc _S C FPPgD'}).get('data-encoded-url')
    loc=re.findall(r'\d+.\d+,.\d+.\d+',str(base64.b64decode(loc)))[0]
    loc=re.findall(r'.\d+.\d+', loc)

    lat=nums_from_string.get_nums(loc[0])[0]
    lng=nums_from_string.get_nums(loc[1])[0]

    restaurant_inf['Latitude']=lat
    restaurant_inf['Longitude']=lng
    
  except IndexError:
    pass

  try: #type of food
    food=soup.find('div', attrs={'class':'tbUiL b'}, text='TIPOS DE COMIDA')
    type_food=food.find_next().text
    restaurant_inf['Type_food']=type_food
  except AttributeError:
    pass

  califications=soup.find_all('div', attrs={'class':'DzMcu'})

  for i in (califications): #calification features
    xd=i.find('span', class_='vzATR').find('span').get('class')[1]
    number=int(re.findall(r'\d+', xd)[0])/10
    restaurant_inf[i.text]=number                  

  
  restaurant_dir.append(restaurant_inf)

Get information of all restaurant in restaurant directory

In [None]:
restaurant_dir=[]

In [35]:
for i in range(len(restaurant_dir),len(df_restaurant_directory)):

  popularity=df_restaurant_directory.popularity[i]
  url=df_restaurant_directory.url[i]
  name=df_restaurant_directory.name[i]

  soup=get_data(url)
  get_restaurant_information(soup, name, popularity,restaurant_dir)

## Results

In [38]:
print('There are information from {} restaurants'.format(len(restaurant_dir)))

There are information from 6730 restaurants


Save in a CSV file

In [39]:
df_restaurant_information=pd.DataFrame(restaurant_dir)
df_restaurant_information.to_csv('D:/DIEGO/Documents/CODIGO/challenge/Challenge repository/Data/restaurant_information.csv')

In [37]:
df_restaurant_information

Unnamed: 0,Popularity,Name,Calfication,Num_reviwes,Phone,Webside,Latitude,Longitude,Type_food,Comida,Servicio,Calidad/precio,Ambiente
0,1,La Mansion Marriott Reforma,5.0,1086.0,+52 55 1102 7021,http://www.facebook.com/mansionmarriottreforma,19.428345,-99.164260,"Mexicana, Parrillada",4.5,4.5,4.5,4.0
1,2,Balta,5.0,546.0,+52 55 8660 0500,http://www.sofitel-mexico-city.com/restaurants...,19.428432,-99.165920,"Mexicana, Mediterránea, Europea, Parrilla",4.5,5.0,4.5,
2,3,Condimento Restaurant,5.0,757.0,+52 55 1102 7030,http://www.marriott.com/hotels/hotel-informati...,19.427828,-99.164024,Internacional,4.5,4.5,4.5,4.0
3,4,La Distral,5.0,393.0,+52 55 5140 4100,http://www.fiestamericana.com/hoteles-y-resort...,19.433002,-99.154580,"Mexicana, Latina, Internacional",4.5,4.5,4.5,
4,5,Taquería y Restaurante Takotl,5.0,167.0,+52 55 4778 6428,http://linkreview.biz/mx/takotl,19.416248,-99.165460,"Mexicana, Internacional, Bar, Pub",5.0,5.0,5.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6725,6726,Santa #cocinacoqueta,,,+52 55 1690 6948,,19.420197,-99.174260,"Mexicana, Saludable",,,,
6726,6727,Momoxco,,,,,19.217780,-99.049440,Mexicana,,,,
6727,6728,La Strada,,,+52 55 9130 1660,http://www.lastradadf.com,19.390265,-99.043290,Italiana,,,,
6728,6729,BP,,,,https://www.bp.com/es_mx/mexico/home.html,19.337640,-99.119210,,,,,
