# ETL Google Reviews (Extract, Load, Transform)

Welcome to this notebook focused on the analysis of Google Reviews. In this project, our aim is to conduct a thorough analysis of user-generated opinions on the Google platform, delving into valuable insights associated with various establishments.

Google, as a prominent business review platform, serves as a substantial data repository. This data source will empower us to comprehend user preferences, accentuate pertinent aspects, and contribute to well-informed decision-making processes.

Throughout this notebook, we will steer through the Extract, Transform, Load (ETL) process to prepare the data effectively. Following that, we will apply diverse sentiment analysis and visualization techniques to derive meaningful insights, shedding light on the nuanced perspectives encapsulated in Google Reviews.

### Requirements

⚠️ **Ensure you have the following libraries installed before running the code**

- pandas
- re

You can install these libraries by opening a terminal or command line window and running the following command:

*`pip install json pandas polars`*

# 1. Import Libraries

In [12]:
import pandas as pd
import os
from datetime import datetime
import re

## 2. Connect and Upload Data

In [2]:
directorio = "D:/Datasets_proyecto/reviews-estados"

# List to store the individual DataFrames
dfs = []

# Iterate over each file in the directory
for filename in os.listdir(directorio):
    if filename.endswith(".json"):
        filepath = os.path.join(directorio, filename)
        
        # Read the JSON file into a DataFrame
        df = pd.read_json(filepath, lines=True)
        
        # Add the DataFrame to the list
        dfs.append(df)

# Concatenate all DataFrames into one        
dfreviewsGoogle = pd.concat(dfs, ignore_index=True)

## 3. Explore and Clean Data

In [3]:
# We read the dfbusinessGoogle dataset because the name of the location is not included, only the ID

dfbusinessGoogle=pd.read_csv(r'D:\Datasets_proyecto\dfbusinessGoogle.csv')


# Get the unique business_ids from df_google

business_ids_to_keep = dfbusinessGoogle['business_id'].unique()


# Filter df_reviews_ulta_beauty using isin

dfreviewsGoogle = dfreviewsGoogle[dfreviewsGoogle['gmap_id'].isin(business_ids_to_keep)]

In [5]:
# From the 'date' column, we extracted the month, year, and hour. 

dfreviewsGoogle['time'] = pd.to_datetime(dfreviewsGoogle['time'], unit='ms')

dfreviewsGoogle['month'] = dfreviewsGoogle['time'].dt.month
dfreviewsGoogle['year'] = dfreviewsGoogle['time'].dt.year
dfreviewsGoogle['hour'] = dfreviewsGoogle['time'].dt.hour

**The following columns are eliminated because they are not relevant to the project**

In [4]:
#name,pics,resp
dfreviewsGoogle.drop(columns=['name','pics','resp'],inplace=True)

#time
dfreviewsGoogle.drop(columns=['time'],inplace=True)

**Columns and Rows Normalization**

In [7]:
# Rename the columns

dfreviewsGoogle = dfreviewsGoogle.rename(columns={'gmap_id': 'business_id', 'rating': 'stars'})

In [9]:
# Reorder the columns

column_order = ['user_id', 'business_id', 'stars', 'text', 'month', 'year', 'hour']
dfreviewsGoogle = dfreviewsGoogle[column_order]

In [15]:
# We convert the text column to lowercase and remove the special characters

def limpiar_texto(texto):
    # Check if the value is a string
    if isinstance(texto, str):
        # Convert to lowercase
        texto = texto.lower()
        # Remove special characters using regular expressions
        texto = re.sub(r'[^a-z0-9\s]', '', texto)
    return texto

dfreviewsGoogle['text'] = dfreviewsGoogle['text'].apply(limpiar_texto)

In [17]:
# The 'source' column is added as an identifier.
# G = Data that comes from the google dataset

dfreviewsGoogle['source']='G'

**Handling null and duplicate values**

In [20]:
# Delete rows where 'text' column is null

dfreviewsGoogle = dfreviewsGoogle.dropna(subset=['text'])


# Remove duplicate rows based on all columns

dfreviewsGoogle = dfreviewsGoogle.drop_duplicates()

In [24]:
# The data set is filtered by year.
# In the scope of the project it was defined to take from 2019 to 2021 from the source provided by Henry

dfreviewsGoogle = dfreviewsGoogle[(dfreviewsGoogle['year'] == 2019) | (dfreviewsGoogle['year'] == 2020) | (dfreviewsGoogle['year'] == 2021)]

## 4. Final Structure

In [25]:
dfreviewsGoogle.head()

Unnamed: 0,user_id,business_id,stars,text,month,year,hour,source
94018,1.018389e+20,0x5342381c6ddf30e3:0xfbe922695b89d6de,5,go to this place specifically for my foundatio...,5,2021,16,G
94019,1.113166e+20,0x5342381c6ddf30e3:0xfbe922695b89d6de,5,i recommend this place\nthey are so polite an...,4,2021,16,G
94020,1.142388e+20,0x5342381c6ddf30e3:0xfbe922695b89d6de,5,love this place friendly staff and always will...,2,2021,23,G
94021,1.07962e+20,0x5342381c6ddf30e3:0xfbe922695b89d6de,5,fantastic place for beauty queens kings,2,2021,4,G
94022,1.05009e+20,0x5342381c6ddf30e3:0xfbe922695b89d6de,5,helpful staff and lots of variety,5,2021,18,G
