# ETL Yelp Reviews (Extract, Load, Transform)

Welcome to this notebook dedicated to Yelp reviews analysis. In this project, we will delve into a comprehensive analysis of user-provided opinions on the Yelp platform, exploring valuable insights related to **Ulta Beauty**.

Yelp, as a leading business review platform, provides us with a significant data source that will enable us to understand user preferences, highlight relevant aspects, and contribute to informed decision-making.

Throughout this notebook, we will guide the Extract, Transform, Load (ETL) process to prepare the data. Subsequently, we will employ various sentiment analysis and visualization techniques to extract meaningful insights.

### Requirements

⚠️ **Ensure you have the following libraries installed before running the code**

- json
- pandas
- polars
- re

You can install these libraries by opening a terminal or command line window and running the following command:

*`pip install json pandas polars`*

# 1. Import Libraries

In [22]:
import json
import pandas as pd
import polars as pl
import re

## 2. Connect and Upload Data

In [2]:
dfreviewsYelp=pl.read_ndjson(r"D:\Datasets_proyecto\review.json").to_pandas()

In [3]:
# We read the dfbusinessyelp dataset because the name is not included

dfbusinessYelp=pd.read_csv(r'D:\Datasets_proyecto\dfbusinessYelp.csv')

## 3. Explore and Clean Data

In [4]:
business_ids = dfbusinessYelp['business_id'].unique()

# Filter dfreviewsYelp

dfreviewsYelp= dfreviewsYelp[dfreviewsYelp['business_id'].isin(business_ids)]

In [14]:
# From the 'date' column, we extracted the month, year, and hour. 

dfreviewsYelp['date'] = pd.to_datetime(dfreviewsYelp['date'], format='%Y-%m-%d %H:%M:%S')

dfreviewsYelp['month'] = dfreviewsYelp['date'].dt.month
dfreviewsYelp['year'] = dfreviewsYelp['date'].dt.year
dfreviewsYelp['hour'] = dfreviewsYelp['date'].dt.time

In [16]:
# The data set is filtered by year.
# In the scope of the project it was defined to take from 2019 to 2021 from the source provided by Henry

dfreviewsYelp = dfreviewsYelp[(dfreviewsYelp['year'] >= 2019) & (dfreviewsYelp['year'] <= 2021)]

**The following columns are eliminated because they are not relevant to the project**

In [None]:
#review_id, useful, funny, cool
dfreviewsYelp.drop(columns=['review_id','useful','funny','cool'],inplace=True)

#date
dfreviewsYelp.drop(columns=['date'],inplace=True)

**Handling null and duplicate values**

In [19]:
dfreviewsYelp.isnull().sum()

user_id        0
business_id    0
stars          0
text           0
month          0
year           0
hour           0
dtype: int64

No null values were found in the dataset

In [20]:
# Get length before removing duplicates
longitud_antes = len(dfreviewsYelp)

# Remove duplicates and update the DataFrame
dfreviewsYelp = dfreviewsYelp.drop_duplicates()

# Get length after removing duplicates
longitud_despues = len(dfreviewsYelp)

# Calculate how many rows were deleted
filas_borradas = longitud_antes - longitud_despues

# Show the number of rows deleted
print(f"Se eliminaron {filas_borradas} filas duplicadas.")

Se eliminaron 0 filas duplicadas.


**Columns and Rows Normalization**

In [23]:
#In the text column everything will be converted to lower case and special characters will be removed

def limpiar_texto(texto):
    texto = texto.lower()
    texto = re.sub(r'[^a-z0-9\s]', '', texto)
    return texto

dfreviewsYelp['text'] = dfreviewsYelp['text'].apply(limpiar_texto)

In [25]:
# The 'source' column is added as an identifier.
# Y = Data that comes from the yelp dataset

dfreviewsYelp['source']='Y'

## 4. Final Structure

In [26]:
dfreviewsYelp.head()

Unnamed: 0,user_id,business_id,stars,text,month,year,hour,source
325357,vGMsqtn5CovrNzJZWfyC1w,idP674ti6a8yg8z2xFcCgA,1.0,third and final trip to this location visited...,1,2019,03:41:11,Y
362966,jyc88KFa8QiFTojcTpwwRA,Vsx34Z-N5S5S0o0f2G6ORw,5.0,amina was so helpful and very friendly i found...,5,2019,19:00:13,Y
366648,ymquu8umi3hXsKnYP0JfbQ,fWMPbickerGWohPy2vDL5A,2.0,i used to like coming here but the last handfu...,10,2019,14:54:36,Y
384938,jtRjv6VBDHdk81UoKuFOww,DJZQCN0NUej_EtviN4rUlg,1.0,given this store three chances now to treat me...,3,2019,09:20:03,Y
390171,dA7uBWRP6-NmOqPd1QqPwg,idP674ti6a8yg8z2xFcCgA,1.0,i dont know why i keep going in this store but...,11,2019,12:59:25,Y
