# Report - Classifying Business Status

## 1. Introduction

### 1.1 Problem
Starting and running a successful business is a fulfilling yet incredibly challenging endeavor. According to the U.S. Bureau of Labor Statistics (BLS), roughly 45% of businesses fail within the first 5 years after its inception. Businesses fail due to various reasons; for instance, lack of market knowledge, bad location, little financing, bad service, and etc. With that being said, I decided to create a classification model that predicts whether the restaurant will close or remain open with given information (revenue, lifespan, review ratings, etc.). 

### 1.2 Target Audience
The model will be useful for financial lenders and investors in determining whether to lend/invest in restaurant business and it also provides aspiring restaurateurs whether their particular restaurant concept will be successful or not. Overall, the model helps minimize financial loss for all involved parties.

### 1.3 Dataset
I am using [Yelp's dataset](https://www.yelp.com/dataset) which was retrieved in July, 2020. The dataset consists of 10 metropolitan areas which ahs the following information:
- 8,021,122 reviews
- 209,393 businesses
- 200,000 pictures
- 1,320,71 tips
- 1,968,703 users
- Over 1.4 million business attributes.

<img src="img/dataset_img.png" alt="business" style="width: 100%;"/>

---

## 2. Data Wrangling
Prior to data wrangling, I utilized [json_to_csv.py](https://github.com/Yelp/dataset-examples/blob/master/json_to_csv_converter.py) to flatten nested json objects within a business' attributes column which has created additional 40 columns that have ‘attributes’ as its prefix. For example, attributes.Caters, attributes.RestaurantsAttire, etc.

The scope of this project is geared towards ‘independent’ restaurants in hospitality industry, therefore, businesses dataset needed to be cleaned and transformed. Below are the following data cleaning summaries per dataset.

### 2.1 Business Dataset
The business dataframe contained 209393 businesses mainly from hospitality industry (restaurants, bars, food, etc.), however there were significant number of businesses that were part of chain restaurants/businesses and businesses that were not food/drinks related (law firms, pet grooming, real estate, etc.); therefore, those were needed to be filtered out.
- Removed 52954 businesses that were part of chain restaurants which cut reduced to 143,958 (38% reduction)
- Removed non-food/drinks related businesses reducing from 143,958 to 44,046 businesses (69.4% reduction)

##### Handling NaNs in Business Dataframe
There were multiple columns related to businesses' attributes such as goodforkids, goodforgroups, tableservice, delivery, etc. that had NaNs - as it was not explicitly defined on initial dataset - I had set to 0 as if it did  not offer those services.

Missing price value within price column (16%) is replaced by rounded average price from most similar businesses between (1-4). 

###### Summary of Business Dataset
1. Filtered business dataset to only populate hospitality related businesses (restaurants, clubs, bars, etc.).
2. Removed businesses that were missing both attributes and categories values as those will be essential for feature engineering.
3. Removed multi-unit restaurants such as chain restaurants as this project is concerned with single-unit businesses.
4. Used categories and attributes data to identify restaurants' characteristics and what type of food/cuisine they are serving.
5. Used cosine similarity amongst businesses to fill in missing price range.
6. Removed columns that were not needed such as address, city, state, and etc.
7. Increase column size from 60 to 144 columns
8. Adjusted average star rating to have better representation at the same scale. For instance, one restaurant with 5 stars rating with 2 reviews is not same as other restaurant with 3.5 stars with 100+ reviews.
8. Reduced business dataset from 209,393 to 44,046 rows.

### 2.2 Review Dataset
The review dataframe contains 8,021,122 entries 
- Removed 4,750,990 reviews that were part of businessses in filtered business dataframe
- Converted text column from object to string datatype

###### Summary of Review Dataset
1. Reduced business dataset from 8,021,121 to 3,270,132 rows (59% reduction); using filered business dataframe's business_id column.
2. No NaNs found
3. Added two new columns:
    - review_type: defines whether review is positive or negative in binary value based on user's average star rating.
    - text_count: total text count per review

### 2.3 User and Tip Dataset
The user dataframe contains 1,968,703 entries which reviewed businesses spanning across 10 metropolitan areas. For this classification project, user dataframe is not needed as the column values provided does not bring any value in predicting whether restaurant will close or not. User dataframe is used to filter tips dataframe using filtered user dataframe (removed users that was not in filtered review dataframe.)

Unlike review dataframe, tips does not have any quantifiable values to determine whether it is a good comment or not for the restaurants. Therefore, sentiment analysis was used to define each tip as positive (1) or negative (0) based on its compound score.

**Note**

The Compound score is a metric that calculates the sum of all the lexicon ratings which have been normalized between -1(most extreme negative) and +1 (most extreme positive).

- positive sentiment : (compound score >= 0.05)
- neutral sentiment : (compound score > -0.05) and (compound score < 0.05)
- negative sentiment : (compound score <= -0.05)

##### Summary of Tip Dataset
1. Converted tip's text column datatype to string.
1. Filtered tip dataset using updated user dataset's user_id.
2. Removed all columns except user_id, text, date, and sentiment analysis related columns.
2. Reduced tip dataset from 1,320,761 to 1,136,880 rows. (14% reduction)
3. Removed two rows that had null values.

### 2.4 Check-In Dataset
The checkin dataframe contains 175,187 entries which is the least amount of entries compared to aforementioned datasets. 

##### Summary of Check-In Dataset
1. Filtered checkin dataset using filtered review dataframe
2. Reduced checkin dataset from 175,187 to 42,296 rows.
3. No null values were found