# Data Scrapping with Python

- HTML Structure: Understand the HTML structure of the website you're scraping. Use browser developer tools to inspect the elements you need to extract data from. Identify unique identifiers such as class names, IDs, or XPath expressions that can help you locate and extract the relevant information.
- Robust Scraping Tools: Python offers several libraries for web scraping, including BeautifulSoup, Scrapy, and Selenium. Choose the one that best fits your needs based on the complexity of the website and the level of interactivity required for scraping (e.g., dynamic content loaded via JavaScript).
- Handling Dynamic Content: Some websites use JavaScript to load content dynamically, making it challenging to scrape using traditional methods. In such cases, consider using headless browsers like Selenium or libraries that can parse JavaScript-rendered pages like Splash.
- Data Extraction: Once you've identified the elements containing the data you need, use the scraping library to extract the information. This may involve parsing HTML, handling nested structures, and cleaning the extracted data to ensure consistency and accuracy.
- Monitoring and Maintenance: Regularly monitor your scraping process for errors, changes in website structure, or potential violations of terms of service. Update your scraping code as needed to adapt to any changes and maintain compliance with the website's policies.

In [1]:
import requests
from bs4 import BeautifulSoup

# URL of the webpage to scrape
url = "https://visitmaia.pt/produtos-turisticos/maia-centro"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
}

# Send a GET request to the URL
response = requests.get(url, headers=headers)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the HTML content of the webpage
    soup = BeautifulSoup(response.content, "html.parser")

    # Find meta tags with property="og:title" and property="og:description"
    og_title_meta = soup.find("meta", property="og:title")
    og_description_meta = soup.find("meta", property="og:description")
    
    # Find all meta tags with property="og:image"
    og_images_meta = soup.find_all("meta", property="og:image")

    # Extract content from meta tags
    og_title = og_title_meta["content"] if og_title_meta else None
    og_description = og_description_meta["content"] if og_description_meta else None
    
    # Extract URLs from all meta tags with property="og:image"
    og_images = [img["content"] for img in og_images_meta] if og_images_meta else None

    # Find all img tags with class="u-back-image u-expanded" and src starting with "https://visitmaia.pt/assets/images/attractions/"
    img_tags = soup.find_all("img", class_="u-back-image u-expanded", src=lambda x: x.startswith("https://visitmaia.pt/assets/images/attractions/"))
    
    # Extract the src attribute value from each img tag
    image_links = [img["src"] for img in img_tags]

    image_links = og_images + image_links

    print("og:title: ", og_title)
    print("og:description: ", og_description)
    print("og:image_links: ", image_links)

else:
    print("Failed to retrieve webpage:", response.status_code)


og:title:  Maia Centro
og:description:  Situada na Maia, a 20 km da P&oacute;voa de Varzim, a Maia Centro disponibiliza um jardim e acesso Wi-Fi gratuito.
Est&atilde;o dispon&iacute;veis op&ccedil;&otilde;es de pequeno-almo&ccedil;o continentais e &agrave; carta, todas as manh&atilde;s no apartamento.
Disp&otilde;e ainda de um terra&ccedil;o onde se encontram todas as comodidades para churrascos no local.
&Eacute; poss&iacute;vel fazer passeios de bicicleta e caminhadas perto do alojamento.
O Porto fica a 10 km da Maia Centro, enquanto Braga est&aacute; a 39 km da propriedade. O aeroporto mais pr&oacute;ximo &eacute; o Aeroporto Francisco S&aacute; Carneiro, a 4 km do apartamento, que organiza um servi&ccedil;o de transfer do aeroporto, por um custo adicional.
og:image_links:  ['https://visitmaia.pt/assets/images/attractions/64c0e2e4d6864.png', 'https://visitmaia.pt/assets/images/attractions/64c0e2f830619.png', 'https://visitmaia.pt/assets/images/attractions/1677850811image2.jpeg', 'ht

In [2]:
import pandas as pd
from datetime import datetime

# Create a timestamp
timestamp = datetime.now()

# Create a dictionary to store the structured data
structured_data = {
    "url": url,
    "title": og_title,
    "description": og_description,
    "image_links": image_links,
    "timestamp": timestamp
}

# Create a DataFrame from the dictionary
df = pd.DataFrame([structured_data])


In [3]:
df

Unnamed: 0,url,title,description,image_links,timestamp
0,https://visitmaia.pt/produtos-turisticos/maia-...,Maia Centro,"Situada na Maia, a 20 km da P&oacute;voa de Va...",[https://visitmaia.pt/assets/images/attraction...,2024-02-06 15:26:06.936534


In [4]:
import spacy

# Load the pre-trained portuguese NER model from spaCy
nlp = spacy.load("pt_core_news_sm")

# Example og:description
og_description=df["description"][0]
# Process the description text using spaCy
doc = nlp(og_description)

# Extract entities
entities = [(ent.text, ent.label_) for ent in doc.ents]

# Filter entities related to distance
distance_entities = [(text, label) for text, label in entities if label == 'QUANTITY']

# Print the distance entities
for entity, label in distance_entities:
    print(label + ':', entity)


In [5]:
entities

[('Maia', 'LOC'),
 ('Varzim', 'LOC'),
 ('Maia Centro', 'LOC'),
 ('Wi-Fi', 'MISC'),
 ('pequeno-almo&ccedil;o continentais', 'LOC'),
 ('manh&atilde;s', 'PER'),
 ('Disp&otilde;e', 'LOC'),
 ('Eacute', 'MISC'),
 ('Porto', 'LOC'),
 ('Maia Centro', 'LOC'),
 ('Braga', 'LOC'),
 ('Aeroporto Francisco', 'LOC'),
 ('Carneiro', 'LOC')]