## User

- user_id: A unique 22-character identifier string that maps to the user in user.json.
- name: A string representing the user's first name.
- review_count: An integer denoting the total number of reviews the user has written.
- yelping_since: A string indicating the date the user joined Yelp, formatted as YYYY-MM-DD.
- friends: An array of strings, each representing a unique user_id of the user's friends.
- useful: An integer representing the total number of useful votes the user has received.
- funny: An integer denoting the number of funny votes received by the user.
- cool: An integer showing the number of cool votes the user has received.
- fans: An integer indicating the total number of fans the user has.
- elite: An array of integers, each representing a year the user was considered elite on Yelp.
- average_stars: A float representing the average rating across all of the user's reviews.
- compliment_hot: An integer denoting the number of hot compliments the user has received.
- compliment_more: An integer showing the number of more compliments received by the user.
- compliment_profile: An integer indicating the number of profile compliments received.
- compliment_cute: An integer representing the number of cute compliments the user has received.
- compliment_list: An integer denoting the number of list compliments received by the user.
- compliment_note: An integer indicating the number of note compliments received.
- compliment_plain: An integer showing the number of plain compliments received by the user.
- compliment_cool: An integer denoting the number of cool compliments received.
- compliment_funny: An integer indicating the number of funny compliments received.
- compliment_writer: An integer representing the number of writer compliments received by the user.
- compliment_photos: An integer showing the number of photo compliments received.

## Business
- business_id: A unique 22-character identifier string for the business.
- name: A string representing the business's name.
- address: A string indicating the full address of the business.
- city: A string denoting the city where the business is located.
- state: A 2-character string representing the state code, if applicable.
- postal code: A string representing the postal code.
- latitude: A float indicating the latitude of the business location.
- longitude: A float showing the longitude of the business location.
- stars: A float for the star rating, rounded to half-stars.
- review_count: An integer indicating the number of reviews.
- is_open: An integer (0 or 1) indicating whether the business is closed or open, respectively.
- attributes: An object mapping business attributes to values, noting that some attribute values might be objects. This includes attributes like RestaurantsTakeOut (boolean) and BusinessParking (an object detailing parking options).
- categories: An array of strings representing the business categories.
- hours: An object mapping days of the week to the business's operating hours, using a 24-hour clock.

## Review

- review_id: A unique 22-character identifier string for the review.
- user_id: A unique 22-character identifier string that maps to the user in user.json.
- business_id: A unique 22-character identifier string that maps to the business in business.json.
- stars: An integer indicating the star rating given in the review.
- date: A string representing the date the review was posted, formatted as YYYY-MM-DD.
- text: A string containing the text of the review itself, detailing the user's experience.
- useful: An integer representing the number of useful votes the review has received.
- funny: An integer indicating the number of funny votes received by the review.
- cool: An integer showing the number of cool votes received by the review.


### Proposed Context
#### User
- **Yelping Since:** *Interval*. Dates can be measured relative to each other, but there's no true zero point.
- **Review Count & Average Stars**: *Ratio*. These have a true zero point (no reviews, zero stars) and can be compared meaningfully in terms of ratio.
- **Friends**: *Nominal*. The IDs of friends are categorical and used for identification.
- **Elite Status**: *Ordinal*. Years of elite status could be ranked, but the intervals between them are not necessarily meaningful.
- **Compliments**: *Ratio*. The counts of different types of compliments have a true zero and can be compared in ratios.['yelpin]
#### Business
- **Location (Latitude & Longitude)**: *Interval*. While these are numerical and can be measured relative to each other, the zero point is arbitrary.
- **Attributes & Categories**: *Nominal*. These are categorical data describing various qualities or types of businesses.
- **Operating Hours**: *Nominal/Interval*. The days of the week are nominal, but the hours can be considered interval data since they have a meaningful order and difference.
  We can either try to average the hours or just use them in each day. For example: one column per day and the hours in each column. Can interaval data be transformed into numbers? Can we actually just make cateories ? 
#### Review
- **Date of Review**: *Interval*. Similar to yelping_since, dates can be measured relative to each other, without a true zero. We can Extract:
  - isWeekday/isWeekend: If it is a weekday or a weekend
  - isHoliday:If it is an international Holiday 
  - season: Which season it is: Spring, Summer, Autum and Winter.
  
  We can actually try to obtain as much 2 option context and just create a column, for example if it's weekday; if it's not weekday it must be a weekend and viceversa. Therefore we can "prescindir" from one of the options
- **Text Sentiment**: *Ordinal*. Sentiment scores, if applied, typically range from negative to positive, implying an order, but the intervals between scores might not be uniform.
- **Votes (Useful, Funny, Cool)**: *Ratio*. These are counts of the number of times a review was found useful, funny, or cool, with a true zero and meaningful ratios.
  This could be used as relevance of a place? 

In [None]:
import pandas as pd 
from datetime import datetime
import numpy as np 
context = ['year', 'month', 'weekday', 'week number', 'longitude', 'latitude', 'season', 'isHoliday', 'isWeekend']
len(context)

In [None]:
df_review = pd.read_csv('../data/raw/YELP/yelp_academic_dataset_review.csv')
df_user = pd.read_csv('../data/raw/YELP/yelp_academic_dataset_user.csv')
df_business = pd.read_csv('../data/raw/YELP/yelp_academic_dataset_business.csv')


In [None]:
df_user.head(1)

In [None]:
df_review.head(1)

In [None]:
df_business.head(1)

Unique Categories in the Dataset

In [None]:
business_categories = set()
for row in df_business.categories:
  if pd.isna(row): 
    continue
  for element in row.split(','):
    business_categories.add(element.strip().lower())

len(business_categories)

Unique Attributes

In [None]:
import pandas as pd
import numpy as np
from collections import defaultdict
import ast  

business_attributes = defaultdict(int)

for row in df_business.attributes:
    if not pd.isna(row):
        try:
            attributes_dict = ast.literal_eval(row)
            for element in attributes_dict.keys():
                business_attributes[element] += 1
        except (ValueError, SyntaxError):
            print(f"Skipping row, unable to convert to dictionary: {row}")



#### Data Transformation

In [None]:
# Context of the user
user_cols = ["user_id", "elite", "yelping_since", "friends"]
# Location
business_col = ["business_id", "latitude", "longitude", 'state']
# Review
review_cols = ["user_id", "business_id", "date", 'stars']
# Merge 
merged_df = pd.merge(
    df_review[review_cols], df_user[user_cols], on="user_id", how="inner", suffixes=["_review", "_user"]
)
# Merge 
data = pd.merge(merged_df, df_business[business_col], on='business_id', how='inner')
# Transform to date 
data['date'] = pd.to_datetime(data['date'])
data['yelping_since'] = pd.to_datetime(data['yelping_since'])

#### Classify into workday/weekend

In [None]:
def classify_day(day):
    if day >= 5:
        return 'weekend'
    else:
        return 'workday'

data['isweekend'] = data['date'].dt.dayofweek.apply(classify_day)


#### Classify on date of time

In [None]:
def classify_time_of_day(time):
    if 5 <= time.hour < 7:
        return 'sunrise'
    elif 7 <= time.hour < 12:
        return 'morning'
    elif 12 <= time.hour < 14:
        return 'noon'
    elif 14 <= time.hour < 17:
        return 'afternoon'
    elif 17 <= time.hour < 20:
        return 'evening'
    else:
        return 'night'

data['daytime'] = data['date'].apply(lambda x: classify_time_of_day(x))


#### Transform date into calendar number

In [None]:
data['week_number'] = data['date'].dt.isocalendar().week

#### Transfrom the date into a holiday or not

In [None]:
import holidays
# Function modified to return "isholiday" or "isnotholiday"
def is_holiday(row):
    ca_provinces = ['AB', 'BC', 'MB', 'NB', 'NL', 'NS', 'NT', 'NU', 'ON', 'PE', 'QC', 'SK', 'YT']
    date, state = row['date'], row['state']
    if state in ca_provinces:
        holiday_list = holidays.Canada(prov=state)
    else:  # For US states and other unspecified codes
        holiday_list = holidays.UnitedStates(state=state if state in holidays.US.subdivisions else None)
    
    # Return "isholiday" or "isnotholiday" based on whether the date is in the holiday list
    return "isholiday" if date in holiday_list else "notholiday"

In [None]:
data['holiday_status'] = data.apply(is_holiday, axis=1)

#### Transfrom friends and elite years into number

In [None]:
data['num_elite'] = data.elite.apply(lambda x: len(x.split(',')) if not pd.isna(x) else 0)
data['num_firends'] = data.friends.apply(lambda x: len(x.split(',')) if not pd.isna(x) else 0)

#### Transform date into season

In [None]:
def get_season(date):
    # Ensure the input is a datetime object
    if not isinstance(date, datetime):
        raise ValueError("The date must be a datetime object")
    
    spring_start = datetime(date.year, 3, 20)
    summer_start = datetime(date.year, 6, 21)
    fall_start = datetime(date.year, 9, 22)
    winter_start = datetime(date.year, 12, 21)

    if date >= spring_start and date < summer_start:
        return 'spring'
    elif date >= summer_start and date < fall_start:
        return 'summer'
    elif date >= fall_start and date < winter_start:
        return 'fall'
    else:
        return 'winter'

data['season'] = data['date'].apply(get_season)

#### Get seniority of a user in the app

In [None]:
max_date = data.date.max().year
data['seniority'] = data['yelping_since'].apply(lambda x: max_date - x.year )

#### Transform ids to integers

In [None]:
data['userId'] = pd.factorize(data['user_id'])[0]
data['businessId'] = pd.factorize(data['business_id'])[0]

#### Filter desired columns

In [None]:
data.columns

In [None]:
columns = [
    "userId",
    "businessId",
    "stars",
    "latitude",
    "longitude",
    "isweekend",
    "daytime",
    "week_number",
    "season",
    "holiday_status",
    "num_firends",
    "num_elite",
    'seniority',
]

In [None]:
data[columns]

In [None]:
import pandas as pd
import json
import sys 


file_path = "../data/raw/YELP/yelp_academic_dataset_business.json"

data = []

with open(file_path, "r") as file:
    for i, line in enumerate(file):
        if i % 1000 == 0:
            print(i)
        if i < 999999999999:  
            json_obj = json.loads(line)

            data.append(json_obj)
        else:
            break  # Stop reading after the 10th line
df = pd.DataFrame(data)
# df.to_csv('../data/raw/YELP/yelp_academic_dataset_business.csv')

In [None]:
df.to_csv('../data/raw/YELP/yelp_academic_dataset_business.csv', index=False)

In [None]:
business_categories = set()
for row in df.categories:
  if row == None:
    continue
  for element in row.split(','):
    business_categories.add(element.strip().lower())

len(business_categories)

In [None]:
from collections import defaultdict



business_attributes = defaultdict(int)
for row in df.attributes:
  if row != None:
    for element in list(row.keys()):
      business_attributes[element] += 1
  
  

In [None]:
sorted_attributes = sorted(business_attributes.items(), key=lambda x: x[1], reverse=True)