# ETL Yelp Business (Extract, Load, Transform)

This project endeavors to investigate and analyze valuable insights and user opinions presented in business reviews on Yelp. Within this framework, the Extract, Transform, Load (ETL) process assumes pivotal significance. We will engage with data derived from businesses listed on Yelp, employing ETL methodologies to guarantee the streamlined collection, transformation, and preparation of the data

### Requirements

⚠️ **Make sure to install the following libraries before running the code**

- pandas
- requests
- geopy

You can install these libraries by opening a terminal or command line window and running the following command:

*`pip install pandas requests geopy`*

## 1. Import Libraries

In [1]:
import pandas as pd
import requests
import io
import re
from datetime import datetime
from geopy.geocoders import Nominatim

## 2. Connect and Upload Data

In [2]:
def cargar_dataset_pickle(file_url):
    response = requests.get(file_url)

    if response.status_code == 200: # Check if the request was successful (status code 200)
        df = pd.read_pickle(io.BytesIO(response.content))
        return df
    else:
        print(f'Error getting file. Status code: {response.status_code}')

url='https://drive.usercontent.google.com/download?id=1byFtzpZXopdCN-XYmMHMpZqzgAqfQBBu&export=download&authuser=0&confirm=t&uuid=189d48ef-ef64-4bc9-aec7-9e654cb4c757&at=APZUnTVmlezTicVw58BDGG9siM6Q:1704808512976'
dfbusinessYelp=cargar_dataset_pickle(url)

## 3. Explore and Clean Data

In [3]:
#Project scope is limited to Ulta Beauty business, so filter applies

# Duplicate columns are removed from the dataframe
dfbusinessYelp = dfbusinessYelp.loc[:, ~dfbusinessYelp.columns.duplicated()]

dfbusinessYelp = dfbusinessYelp[dfbusinessYelp['name'] == 'Ulta Beauty']

**The following columns are eliminated because they are not relevant to the project**

In [4]:
#is_open
dfbusinessYelp = dfbusinessYelp.drop('is_open', axis=1)

#address
dfbusinessYelp = dfbusinessYelp.drop('address', axis=1) #For the address, the latitude and longitude data are taken.

#name
dfbusinessYelp = dfbusinessYelp.drop('name', axis=1) #All the stores have the same name "Ulta Beauty", business_id will be the identifier

#categories
dfbusinessYelp = dfbusinessYelp.drop('categories', axis=1) #All Ulta Beaty categories correspond to the Beauty Industry

#attributes
dfbusinessYelp = dfbusinessYelp.drop('attributes', axis=1)

#hours
dfbusinessYelp = dfbusinessYelp.drop('hours', axis=1)

**Handling null values**

In [9]:
round(dfbusinessYelp.isnull().sum() / dfbusinessYelp.shape[0] * 100, 2).astype(str) + ' %'

business_id     0.0 %
city            0.0 %
state           0.0 %
postal_code     0.0 %
latitude        0.0 %
longitude       0.0 %
stars           0.0 %
review_count    0.0 %
dtype: object

No null values were found in the dataset

**Columns and Rows Normalization**

In [10]:
# Convert column 'name' to snake_case format

def snake_case(column_name):
    return re.sub(r'(?<=[a-z])(?=[A-Z])', '_', column_name).lower()


dfbusinessYelp.columns = dfbusinessYelp.columns.map(snake_case)

In [11]:
# For the 'city' column we will convert to snake_case

dfbusinessYelp['city'] = dfbusinessYelp['city'].str.lower().str.replace(' ', '_')

In [12]:
# The acronyms are replaced by the name of the states in the 'state' column

estado_mapping = {
    'PA': 'pennsylvania',
    'FL': 'florida',
    'NV': 'nevada',
    'LA': 'louisiana',
    'AZ': 'arizona',
    'IN': 'indiana',
    'TN': 'tennessee',
    'MO': 'missouri',
    'CA': 'california',
    'ID': 'idaho',
    'NJ': 'new_jersey',
    'DE': 'delaware',
    'IL': 'illinois',
    'AB':'AB'
}

# Apply the mapping

dfbusinessYelp['state'] = dfbusinessYelp['state'].map(estado_mapping)

In [13]:
#Reviews with status AB are incorrectly identified, so latitude and longitude will be used to determine the state to which they belong.

def obtener_estado(latitud, longitud):
    geolocator = Nominatim(user_agent="my_geocoder")
    location = geolocator.reverse((latitud, longitud), language="en")

    if location is not None:
        # La información sobre el estado generalmente se encuentra en el nivel de "address" en la respuesta
        estado = location.raw.get('address', {}).get('state', None)
        return estado
    else:
        return None
sindato=0
faltantes=dfbusinessYelp[dfbusinessYelp.state=='AB'][['latitude', 'longitude']].drop_duplicates()
for a in range(len(faltantes)):
    latitud=faltantes.iloc[a,0]
    longitud=faltantes.iloc[a,1]

    estado = obtener_estado(latitud, longitud)

    if estado:
        dfbusinessYelp.loc[(dfbusinessYelp.latitude==latitud)&(dfbusinessYelp.longitude==longitud),'state']=estado
    else:
        sindato+=1
        continue

In [14]:
# The 'source' column is added as an identifier.
# Y = Data that comes from the yelp dataset

dfbusinessYelp['source']='Y'

## 4. Final Structure

In [15]:
dfbusinessYelp.head()

Unnamed: 0,business_id,city,state,postal_code,latitude,longitude,stars,review_count,source
883,4uqRhXZTOzKF2ZhxbWzxfA,newark,pennsylvania,19702,39.672058,-75.6489,3.5,11,Y
1063,fWMPbickerGWohPy2vDL5A,plainfield,arizona,46168,39.713441,-86.357947,3.0,14,Y
5488,DJZQCN0NUej_EtviN4rUlg,philadelphia,pennsylvania,19131,39.978981,-75.27146,3.5,12,Y
13384,Vxqa8u_5RD5e7oBqdaU0yQ,fairview_heights,Illinois,62208,38.596645,-89.987348,3.5,13,Y
13760,idP674ti6a8yg8z2xFcCgA,newtown_square,arizona,19073,39.987189,-75.403201,2.5,24,Y
