Introduction
============

In this project, I'll explore the realm of business success, particularly in the context of consumer preferences and behavior. Utilizing an extensive dataset from Yelp, a super-popular business directory used worldwide, the study aims to uncover various factors that correlate with business success. Yelp's comprehensive data provides a unique window into the dynamics of consumer-business interactions, offering valuable insights into what drives consumer decisions and, consequently, business success.

The dataset for this research is sourced from Yelp, including business ID, name, location details (address, city, state, postal code, latitude, longitude), operational aspects (attributes, hours), and consumer-generated metrics (stars, review count). I define 'business success' as a composite metric from Yelp's star ratings and the number of reviews, reflecting both customer satisfaction and engagement levels. The choice of variables such as state, latitude, longitude, attributes, and hours as predictors comes from our hypothesis that these factors significantly influence the consumer's decision and, hence, leading to the business success.

The core research question is:**"What are the key factors that correlate with business success"** findings

In [1]:
import matplotlib.pyplot as plt 
import numpy as np
import pandas as pd

df_b = pd.read_json(r"yelp_academic_dataset_business.json", lines=True)
#Import packages and data

In [2]:
has_na = df_b.isna().any().any()

print(f"Are there any missing values in the DataFrame? {has_na}")

Are there any missing values in the DataFrame? True


In [3]:
df_b_c = df_b.dropna()
#Drop all the missing data

In [7]:
df_b_c.loc[:, 'business_success'] = df_b_c['stars'] * 0.5 + (100 - df_b_c['review_count'].rank(pct=True, ascending=False) * 100) / 40
# Create the 'business_success' column.

        stars  review_count  business_success
1         3.0            15          2.607197
2         3.5            22          3.153569
3         4.0            80          4.122103
4         4.5            13          3.236456
5         2.0             6          1.270462
...       ...           ...               ...
150340    4.5            18          3.502338
150341    3.0            13          2.486456
150342    4.0             5          2.095702
150344    4.0            24          3.467027
150345    4.5             9          2.904874

[117618 rows x 3 columns]


In [10]:
print(f"summary statistics table:\n {df_b_c[['state','latitude','longitude','is_open','hours','business_success']].describe()}")

summary statistics table:
             latitude      longitude        is_open  business_success
count  117618.000000  117618.000000  117618.000000     117618.000000
mean       36.612308     -89.277679       0.807495          3.079650
std         5.838800      14.804658       0.394269          0.858535
min        27.555127    -120.095137       0.000000          0.595702
25%        32.173332     -90.349720       1.000000          2.502542
50%        38.731374     -86.120175       1.000000          3.082789
75%        39.953499     -75.449811       1.000000          3.753261
max        53.651838     -73.200457       1.000000          4.993028


In [11]:
# Define a list of attributes that are considered attractive.
attractive_attributes = ['AcceptsInsurance', 'AgesAllowed', 'Alcohol', 'Ambience', 'BYOB', 'BYOBCorkage', 'BestNights', 'BikeParking', 'BusinessAcceptsBitcoin', 'BusinessAcceptsCreditCards', 'BusinessParking', 'ByAppointmentOnly', 'Caters', 'CoatCheck', 'Corkage', 'DietaryRestrictions', 'DogsAllowed', 'DriveThru', 'GoodForDancing', 'GoodForKids', 'GoodForMeal', 'HairSpecializesIn', 'HappyHour', 'HasTV', 'Music', 'NoiseLevel', 'Open24Hours', 'OutdoorSeating', 'RestaurantsAttire', 'RestaurantsCounterService', 'RestaurantsDelivery', 'RestaurantsGoodForGroups', 'RestaurantsPriceRange2', 'RestaurantsReservations', 'RestaurantsTableService', 'RestaurantsTakeOut', 'Smoking', 'WheelchairAccessible', 'WiFi']

# Function to count attractive attributes.
def count_attractive_attributes(row):
    return sum(1 for attr in attractive_attributes if row.get(attr) == 'True')

# Apply the function to each row.
df_b_c['attribute_point'] = df_b_c['attributes'].apply(count_attractive_attributes)

# Now df has the new 'attribute_point' column.
print(df_b_c[['attribute_point']])


        attribute_point
1                     1
2                     3
3                     3
4                     5
5                     9
...                 ...
150340                4
150341                0
150342                2
150344                2
150345                2

[117618 rows x 1 columns]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_b_c['attribute_point'] = df_b_c['attributes'].apply(count_attractive_attributes)
