<p style="text-align: center; font-size: 30px;">
    <strong>
          Scraper for Amazon
    </strong>
</p>

----

<b style="text-align: left; font-size: 18px;">
    The following project is specifically designed to scrape publicly available data concerning user reviews for goods listed on amazon.es
</b>

----

<b style="font-size: 20px;">Project Roadmap</b>

-  <p style="font-size: 18px;">Create a list of target URLs</p>
-  <p style="font-size: 18px;">Develop a program to automatically access specified websites and extract their data</p>
-  <p style="font-size: 18px;">Save the acquired data in a DataFrame for future applications</p>
---

<b style="font-size: 20px;">Required tools</b>

<blockquote style="background-color: #f0f0f0; padding: 10px; border-left: 10px solid #3498db; font-size: 18px;">
    <a href="https://www.python.org/downloads/" target="_blank" rel="noopener noreferrer">Latest version of Python</a>
</blockquote>

<blockquote style="background-color: #f0f0f0; padding: 10px; border-left: 10px solid #3498db; font-size: 18px;">
    <a href="https://googlechromelabs.github.io/chrome-for-testing/" target="_blank" rel="noopener noreferrer">Latest version of WebDriver</a>
</blockquote>

<blockquote style="background-color: #f0f0f0; padding: 10px; border-left: 10px solid #3498db; font-size: 18px;">
    Selenuim library for scraping purposes
</blockquote>

In [2]:
!pip install selenium



<blockquote style="background-color: #f0f0f0; padding: 10px; border-left: 10px solid #3498db; font-size: 18px;">
    Pandas library for data storage and manipulation
</blockquote>

In [4]:
!pip install pandas



<blockquote style="background-color: #f0f0f0; padding: 10px; border-left: 10px solid #3498db; font-size: 18px;">
    TQDM library for progress check 
</blockquote>

In [6]:
!pip install tqdm



----
<b style="font-size: 20px;">Specification of the target URLs</b>

<p style="font-size: 18px;">The same method can be applied to other goods on amazon.es</p>

```python
base_url = 'https://www.amazon.es/CREATE-THERA-Cafetera-monodosis-semiautom%C3%A1tica/product-reviews/B0BSH8CYN8/ref=cm_cr_getr_d_paging_btm_next_2?ie=UTF8&pageNumber=1&reviewerType=avp_only_reviews'

urls = []
for i in range(1, 319):
    
    prefix = base_url[:155]  
    suffix = base_url[156:]

    updated_url = f"{prefix}{i}{suffix}"
    
    urls.append(updated_url)

```
----
<b style="font-size: 20px;">Automatic interaction with web services</b>

<p style="font-size: 18px;">WebDriver is a tool that provides a programmatic interface for interacting with web browsers</p>

```python
service = Service(DRIVER_PATH)
driver = webdriver.Chrome(service=service, options=options)
```
----

<b style="font-size: 20px;">Necessary adgustments for the bot</b>

<p style="font-size: 18px;">To avoid restrictions the bot must emulate the behaviour of a human user</p>

<p style="font-size: 18px;">This setting allows the bot to send a user-agent string with browser specs, just like a human would</p>

```python
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")


```

<p style="font-size: 18px;">Freezing the process for random time periods also helps to avoid detection</p>

```python
time.sleep(random.randint(1, 3))
```
----
<p style="text-align: center; font-size: 25px;">
    <strong>
          The final code
    </strong>
</p>

---

In [35]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import pandas as pd

import time
import random
from tqdm import tqdm

def get_urls():
    base_url = 'https://www.amazon.es/CREATE-THERA-Cafetera-monodosis-semiautom%C3%A1tica/product-reviews/B0BSH8CYN8/ref=cm_cr_getr_d_paging_btm_next_2?ie=UTF8&pageNumber=1&reviewerType=avp_only_reviews'
    list_urls = []

    for i in range(1, 319):
    
        prefix = base_url[:155]  
        suffix = base_url[156:]
    
        #saltar páginas
        updated_url = f"{prefix}{i}{suffix}"
    
        list_urls.append(updated_url)
    
    return list_urls

def scrape_reviews(urls):
    
    options = Options()
    options.add_argument('--headless') 
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-dev-shm-usage')
    options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")


    DRIVER_PATH = '/Users/innafilatova/Desktop/project_cafetera/chromedriver'

    service = Service(DRIVER_PATH)
    driver = webdriver.Chrome(service=service, options=options)
    
    reviews = []
    
    for url in tqdm(urls, desc='Processing URLs', unit='URL'): 
        
        driver.get(url)
        
        time.sleep(random.randint(1, 3))


        # Extracción de reseñas
        review_blocks = driver.find_elements(By.CSS_SELECTOR, '.a-section.review')
        for review_block in review_blocks:
            title = review_block.find_element(By.CSS_SELECTOR, '.review-title-content').text.strip()
            rating = review_block.find_element(By.CSS_SELECTOR, '.a-icon-alt').get_attribute('textContent').strip()
            body = review_block.find_element(By.CSS_SELECTOR, '[data-hook="review-body"]').text.strip()
            author = review_block.find_element(By.CSS_SELECTOR, '.a-profile-name').text.strip()
            date = review_block.find_element(By.CSS_SELECTOR, '.review-date').text.strip()

            reviews.append({
                'title': title,
                'rating': rating,
                'body': body,
                'author': author,
                'date': date,
            })
    
    print("Done.")
    driver.quit()
    return reviews

# Descargar datos obtenidos
def save_to_csv(reviews, filename='amazon_reviews.csv'):
    df = pd.DataFrame(reviews)
    df.to_csv(filename, index=False, encoding='utf-8')

if __name__ == "__main__":

    urls = get_urls()
    reviews = scrape_reviews(urls)
    save_to_csv(reviews, 'amazon_reviews.csv')
    print(f'Saved {len(reviews)} reviews in amazon_reviews.csv')

Processing URLs: 100%|███████████████████████| 318/318 [11:59<00:00,  2.26s/URL]

Done.
Saved 10 reviews in amazon_reviews.csv





---
<b style="font-size: 20px;">Saving data in a DataFrame</b>


In [25]:
# Descargar datos en DataFrame
data = pd.read_csv('amazon_reviews.csv')

data.head()

Unnamed: 0,title,rating,body,author,date
0,preciosa,"5,0 de 5 estrellas",a mi marido le encanta como sale el cafe,Noelia,Revisado en España el 12 de junio de 2024
1,Buena cafetera pésimas instrucciones,"4,0 de 5 estrellas","Llegó con el depósito del agua roto, me puse e...",Pedro Maria Alonso Villaro,Revisado en España el 31 de marzo de 2023
2,Muy linda y buena pero fragil,"4,0 de 5 estrellas","La cafetera es muy linda, el estilo es muy chu...",Alexis Vaquero,Revisado en España el 23 de enero de 2024
3,Buena cafetera,"4,0 de 5 estrellas",Después de un mes usándola aprovecho para pone...,An,Revisado en España el 6 de febrero de 2022
4,Hace un cafè muy Bueno .,"4,0 de 5 estrellas",Si la temperatura es demasiado alta cosa que s...,Xavier B.,Revisado en España el 27 de noviembre de 2023


Limpieza de texto