# ETL Incremental Load Google Data (Extract, Load, Transform)

This notebook focuses on retrieving and preprocessing Google Maps reviews data using the Apify API. The collected data includes information such as user IDs, place IDs, star ratings, review text, and timestamps. The objective is to clean and structure the data for further analysis.

# 1. Import Libraries

In [1]:
import pandas as pd
import requests
from datetime import datetime
import re

## 2. Connect and Upload Data

In [None]:
# API URL with token
api_url = "https://api.apify.com/v2/actor-tasks/frombini~google-maps-scraper-task/runs/last/dataset/items?token=apify_api_ZE7FMykxbpsHef9FAMxUqF9esYgIGK2LrElm&unwind=reviews&fields=placeId,reviews&omit=textTranslated,publishAt,likesCount,reviewId,reviewUrl,reviewerUrl,reviewerPhotoUrl,reviewerNumberOfReviews,isLocalGuide,rating,reviewImageUrls,reviewContext,reviewDetailedRating,name,responseFromOwnerDate,responseFromOwnerText"

# Make a GET request
response = requests.get(api_url)

# Check the response status
if response.status_code == 200:
    # Parse the JSON response
    data = response.json()

    # Print the data (or do whatever you want with it)
    print(data)
else:
    print(f"Error in the request: {response.status_code}, {response.text}")

# Create a new DataFrame
new_data = pd.DataFrame(data=data)
new_data.head()

**Drop rows with missing text**

In [3]:
new_data.dropna(subset=['text'], inplace=True)

**Remove duplicate rows**

In [4]:
new_data.drop_duplicates(inplace=True)

**Rename columns**

In [None]:
new_data = new_data.rename(columns={'reviewerId': 'user_id', 'placeId': 'place_id'})

**Convert the 'publishedAtDate' column to datetime type**

In [6]:
# Convertir la columna publishedAtDate a tipo datetime
new_data['publishedAtDate'] = pd.to_datetime(new_data['publishedAtDate'])

# Extract month, year, and hour
new_data['month'] = new_data['publishedAtDate'].dt.month
new_data['year'] = new_data['publishedAtDate'].dt.year
new_data['hour'] = new_data['publishedAtDate'].dt.hour

new_data.drop(columns=['publishedAtDate'], inplace=True)

**Define column order**

In [7]:
column_order = ['user_id', 'place_id', 'stars', 'text', 'month', 'year', 'hour']
new_data = new_data[column_order]

**Convert the 'text' column to lowercase and remove special characters**

In [8]:
def clean_text(text):
    # Check if the value is a string
    if isinstance(text, str):
        # Convert to lowercase
        text = text.lower()
        # Remove special characters using regular expressions
        text = re.sub(r'[^a-z0-9\s]', '', text)
    return text

new_data['text'] = new_data['text'].apply(clean_text)
new_data.head()

**Change 'place_id' to 'business_id**

In [None]:
# Read the 'businessId_gmapID' table
businessId_gmapID = pd.read_csv(r"D:\Datasets_proyecto\businessId_gmapID.csv")

# Replace 'Sin Valores' in 'place_id' with the value of 'business_id'
businessId_gmapID['place_id'] = businessId_gmapID.apply(lambda row: row['business_id'] if row['place_id'] == 'Sin Valores' else row['place_id'], axis=1)

# Concatenate based on the 'place_id' column
new_data = pd.merge(businessId_gmapID, new_data, on='place_id', how='inner')

# Drop the 'place_id' column
new_data.drop(columns=['Unnamed: 0', 'place_id'], inplace=True)
new_data.head()

Codigo de carga de YELP (Apify) 

In [None]:
# URL de la API con el token
api_url = "https://api.apify.com/v2/actor-tasks/frombini~yelp-scrap-task/runs/last/dataset/items?token=apify_api_ZE7FMykxbpsHef9FAMxUqF9esYgIGK2LrElm&unwind=review"

# Realiza la solicitud GET
response = requests.get(api_url)

# Verifica el estado de la respuesta
if response.status_code == 200:
    # Parsea la respuesta JSON
    data = response.json()

    # Imprime los datos
    print(data)
else:
    print(f"Error en la solicitud: {response.status_code}, {response.text}")