Reviews of Categories

Find the top business categories based on the total number of reviews. Output the category along with the total number of reviews. Order by total reviews in descending order.

In [1]:
import pandas as pd

In [2]:
yelp_business = pd.read_csv("../CSV/yelp_business.csv")
yelp_business

Unnamed: 0,business_id,name,neighborhood,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,categories
0,G5ERFWvPfHy7IDAUYlWL2A,All Colors Mobile Bumper Repair,,7137 N 28th Ave,Phoenix,AZ,85051,33.448,-112.074,1.0,4,1,Auto Detailing;Automotive
1,0jDvRJS-z9zdMgOUXgr6rA,Sunfare,,811 W Deer Valley Rd,Phoenix,AZ,85027,33.683,-112.085,5.0,27,1,Personal Chefs;Food;Gluten-Free;Food Delivery ...
2,6HmDqeNNZtHMK0t2glF_gg,Dry Clean Vegas,Southeast,"2550 Windmill Ln, Ste 100",Las Vegas,NV,89123,36.042,-115.118,1.0,4,1,Dry Cleaning & Laundry;Laundry Services;Local ...
3,pbt3SBcEmxCfZPdnmU9tNA,The Cuyahoga Room,,740 Munroe Falls Ave,Cuyahoga Falls,OH,44221,41.140,-81.472,1.0,3,0,Wedding Planning;Caterers;Event Planning & Ser...
4,CX8pfLn7Bk9o2-8yDMp_2w,The UPS Store,,"4815 E Carefree Hwy, Ste 108",Cave Creek,AZ,85331,33.798,-111.977,1.5,5,1,Notaries;Printing Services;Local Services;Ship...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,l4xrBZAKLXpSR4iprqTw8A,Mark's,,"1015 Lakeshore Boulevard E, Unit 5",Toronto,ON,M4M 1B3,43.656,-79.332,5.0,3,1,Women's Clothing;Shopping;Fashion;Men's Clothing
96,ICdzSGuv70gpSk7aqpIrHw,Wok-In Bbq,,"3540 Rutherford Road, Unit 67",Vaughan,ON,L4H 3T8,43.829,-79.549,4.5,10,1,Chinese;Barbeque;Restaurants
97,wk3wGDfJb1V-ciZpyhoNAA,Bic's Pub and Grill,,560 State Road 130,Trafford,PA,15085,40.390,-79.727,4.5,7,1,Pubs;American (Traditional);Nightlife;Bars;Piz...
98,NBYN4Nks_EsPHyAlJ_mdNw,Bistro Merlot,,18425 Antoine Faucon,Pierrefonds,QC,H9K 1M7,45.452,-73.887,4.5,8,0,Salad;Pizza;Restaurants;Event Planning & Servi...


In [3]:
result = yelp_business[['review_count', 'categories']].set_index('review_count').apply(lambda x: x.str.split(';').explode()).reset_index()
result

Unnamed: 0,review_count,categories
0,4,Auto Detailing
1,4,Automotive
2,27,Personal Chefs
3,27,Food
4,27,Gluten-Free
...,...,...
398,5,Property Management
399,5,Condominiums
400,5,Apartments
401,5,Home Services


In [4]:
result = result.groupby(['categories'])['review_count'].sum().to_frame('total_reviews').reset_index().sort_values('total_reviews', ascending=False)
result

Unnamed: 0,categories,total_reviews
141,Restaurants,1703
62,Food,508
133,Pizza,456
33,Chinese,417
95,Japanese,350
...,...,...
132,Pilates,3
64,Food Stands,3
154,Sporting Goods,3
41,Curry Sausage,3


Solution Walkthrough
The objective of this code is to find the top business categories based on the total number of reviews and output the category along with the total number of reviews. The code uses the pandas library to manipulate and analyze the data.

Understanding The Data
The data used in this code is stored in a variable called yelp_business. From this data, the columns 'review_count' and 'categories' are selected. The 'categories' column contains a string of categories separated by semicolons (;). The code aims to split the 'categories' column values and explode them into separate rows based on the semicolon separator.

The Problem Statement
The code aims to find the total number of reviews for each business category and order them in descending order based on the total reviews. The output should include the category name and the total number of reviews.

Breaking Down The Code
Let's break down the code into smaller steps:

The first line of code imports the pandas library and assigns it the alias 'pd'.
import pandas as pd
The next line of code selects the 'review_count' and 'categories' columns from the yelp_business dataframe. The [['review_count', 'categories']] syntax selects specific columns from the dataframe.
result = yelp_business[["review_count", "categories"]]
The set_index('review_count') function is used to set the 'review_count' column as the index of the dataframe. This will make it easier to manipulate the data later on.
result = result.set_index("review_count")
The apply(lambda x: x.str.split(';').explode()) function is used to split the values in the 'categories' column using the ';' separator. This splits the strings into lists of categories.
result = result.apply(lambda x: x.str.split(";").explode())
After exploding the data, the dataframe is reset with the reset_index() function. This restores the default index and converts the 'review_count' column from the index back into a regular column.
result = result.reset_index()
The groupby(['categories'])['review_count'].sum() function groups the data by the 'categories' column and calculates the sum of the 'review_count' column for each category.
result = result.groupby(["categories"])["review_count"].sum()
The to_frame('total_reviews') function converts the resulting series into a dataframe and gives the column a name 'total_reviews'.
result = result.to_frame("total_reviews")
Finally, the dataframe is sorted in descending order based on the 'total_reviews' column using the .sort_values('total_reviews', ascending=False) function.
result = result.sort_values("total_reviews", ascending=False)
Bringing It All Together
import pandas as pd

result = (
    yelp_business[["review_count", "categories"]]
    .set_index("review_count")
    .apply(lambda x: x.str.split(";").explode())
    .reset_index()
)
result = (
    result.groupby(["categories"])["review_count"]
    .sum()
    .to_frame("total_reviews")
    .reset_index()
    .sort_values("total_reviews", ascending=False)
)
The code imports the pandas library and assigns it the alias 'pd'. Then it selects the 'review_count' and 'categories' columns from the yelp_business dataframe. The code sets the 'review_count' column as the index, splits the values in the 'categories' column using the ';' separator, and explodes the data into separate rows. It then groups the data by category and calculates the sum of the review counts for each category. Finally, the dataframe is sorted in descending order based on the total reviews.

Conclusion
The code successfully finds the top business categories based on the total number of reviews by performing various operations using the pandas library. The resulting dataframe contains the category names along with the corresponding total number of reviews, ordered in descending order of the total reviews.