# Yelp Dataset Sanity Checks & Preview

This notebook/script performs a quick sanity check and preview of the raw Yelp JSON datasets before full ETL processing. It includes:

1. **Row Counts**
   Fast newline-based counts for each JSON file to confirm expected data volume.

2. **Review Date Range**
   Chunked parsing of the review file to identify the earliest and latest review timestamps.

3. **Schema Preview**
   Top‑n row display for each dataset (`business`, `user`, `review`, `tip`, `checkin`) to verify field names and sample values.

4. **Missing Value Analysis**
   Quick scan of NaN percentages in key columns using a subset of rows.

5. **State Code Validation**
   Regex check for non‑standard or unexpected state abbreviations.

6. **Rating & Date Integrity**
   Validation that review stars fall within 1–5 and dates parse correctly.


In [1]:
# 0 | Imports & constants
from pathlib import Path
import pandas as pd

RAW = Path("../data/raw")
assert RAW.exists(), "`data/raw` folder not found – check project structure."

In [2]:
# 1 | File row counts (sanity check)
# Quickly count newline characters – inexpensive and avoids loading full JSON.
def count_lines(path: Path, buf_size: int = 1024 * 1024) -> int:
    n = 0
    with path.open("rb") as f:
        for chunk in iter(lambda: f.read(buf_size), b""):
            n += chunk.count(b"\n")
    return n

FILES = [
    "yelp_academic_dataset_business.json",
    "yelp_academic_dataset_user.json",
    "yelp_academic_dataset_review.json",
    "yelp_academic_dataset_tip.json",
    "yelp_academic_dataset_checkin.json",
]

row_counts = {fname: count_lines(RAW / fname) for fname in FILES}
row_counts

{'yelp_academic_dataset_business.json': 150346,
 'yelp_academic_dataset_user.json': 1987897,
 'yelp_academic_dataset_review.json': 6990280,
 'yelp_academic_dataset_tip.json': 908915,
 'yelp_academic_dataset_checkin.json': 131930}

In [4]:
# 2 | Review date range & project anchor date
min_d, max_d = None, None
for chunk in pd.read_json(
    RAW / "yelp_academic_dataset_review.json",
    lines=True,
    chunksize=200_000,
):
    # After reading, extract the 'date' column and convert it to datetime
    dates = pd.to_datetime(chunk["date"])
    dmin, dmax = dates.min(), dates.max()
    min_d = dmin if min_d is None else min(min_d, dmin)
    max_d = dmax if max_d is None else max(max_d, dmax)

print("Review dates range:", min_d, "to", max_d)

Review dates range: 2005-02-16 03:23:22 to 2022-01-19 19:48:45


In [6]:
# 3 | Quick head preview for each file
from IPython.display import display

def preview(fname: str, n: int = 3):
    df = pd.read_json(RAW / fname, lines=True, nrows=n)
    print(f"\n=== {fname} (top {n}) ===")
    display(df)

for f in FILES:
    preview(f)


=== yelp_academic_dataset_business.json (top 3) ===


Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,CA,93101,34.426679,-119.711197,5.0,7,0,{'ByAppointmentOnly': 'True'},"Doctors, Traditional Chinese Medicine, Naturop...",
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,1,{'BusinessAcceptsCreditCards': 'True'},"Shipping Centers, Local Services, Notaries, Ma...","{'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ..."
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,0,"{'BikeParking': 'True', 'BusinessAcceptsCredit...","Department Stores, Shopping, Fashion, Home & G...","{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ..."



=== yelp_academic_dataset_user.json (top 3) ===


Unnamed: 0,user_id,name,review_count,yelping_since,useful,funny,cool,elite,friends,fans,...,compliment_more,compliment_profile,compliment_cute,compliment_list,compliment_note,compliment_plain,compliment_cool,compliment_funny,compliment_writer,compliment_photos
0,qVc8ODYU5SZjKXVBgXdI7w,Walker,585,2007-01-25 16:47:26,7217,1259,5994,2007,"NSCy54eWehBJyZdG2iE84w, pe42u7DcCH2QmI81NX-8qA...",267,...,65,55,56,18,232,844,467,467,239,180
1,j14WgRoU_-2ZE1aw1dXrJg,Daniel,4333,2009-01-25 04:35:42,43091,13066,27281,"2009,2010,2011,2012,2013,2014,2015,2016,2017,2...","ueRPE0CX75ePGMqOFVj6IQ, 52oH4DrRvzzl8wh5UXyU0A...",3138,...,264,184,157,251,1847,7054,3131,3131,1521,1946
2,2WnXYQFK0hXEoTxPtV2zvg,Steph,665,2008-07-25 10:41:00,2086,1010,1003,20092010201120122013,"LuO3Bn4f3rlhyHIaNfTlnA, j9B4XdHUhDfTKVecyWQgyA...",52,...,13,10,17,3,66,96,119,119,35,18



=== yelp_academic_dataset_review.json (top 3) ===


Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,KU_O5udG6zpxOg-VcAEodg,mh_-eMZ6K5RLWhZyISBhwA,XQfwVwDr-v0ZS3_CbbE5Xw,3,0,0,0,"If you decide to eat here, just be aware it is...",2018-07-07 22:09:11
1,BiTunyQ73aT9WBnpR9DZGw,OyoGAe7OKpv6SyGZT5g77Q,7ATYjTIgM3jUlt4UM3IypQ,5,1,0,1,I've taken a lot of spin classes over the year...,2012-01-03 15:28:18
2,saUsX_uimxRlCVr67Z4Jig,8g_iMtfSiwikVnbP2etR0A,YjUWPpI6HXG530lwP-fb2A,3,0,0,0,Family diner. Had the buffet. Eclectic assortm...,2014-02-05 20:30:30



=== yelp_academic_dataset_tip.json (top 3) ===


Unnamed: 0,user_id,business_id,text,date,compliment_count
0,AGNUgVwnZUey3gcPCJ76iw,3uLgwr0qeCNMjKenHJwPGQ,Avengers time with the ladies.,2012-05-18 02:17:21,0
1,NBN4MgHP9D3cw--SnauTkA,QoezRbYQncpRqyrLH6Iqjg,They have lots of good deserts and tasty cuban...,2013-02-05 18:35:10,0
2,-copOvldyKh1qr-vzkDEvw,MYoRNLb5chwjQe3c_k37Gg,It's open even when you think it isn't,2013-08-18 00:56:08,0



=== yelp_academic_dataset_checkin.json (top 3) ===


Unnamed: 0,business_id,date
0,---kPU91CF4Lq2-WlRu9Lw,"2020-03-13 21:10:56, 2020-06-02 22:18:06, 2020..."
1,--0iUa4sNDFiZFrAdIWhZQ,"2010-09-13 21:43:09, 2011-05-04 23:08:15, 2011..."
2,--30_8IhuyMHbSOcNWd6DQ,"2013-06-14 23:29:17, 2014-08-13 23:20:22"


In [10]:
# 4 Quick scan of missing value percentages
for fname in FILES[:3]:
    df = pd.read_json(RAW/fname, lines=True, nrows=100_000)
    na_pct = df.isna().mean().sort_values(ascending=False)[:10]
    print(f"\n{fname}  ——  TOP NaN%")
    print(na_pct.to_string())


yelp_academic_dataset_business.json  ——  TOP NaN%
hours          0.15436
attributes     0.09085
categories     0.00069
address        0.00000
business_id    0.00000
name           0.00000
postal_code    0.00000
state          0.00000
city           0.00000
latitude       0.00000

yelp_academic_dataset_user.json  ——  TOP NaN%
user_id          0.0
name             0.0
review_count     0.0
yelping_since    0.0
useful           0.0
funny            0.0
cool             0.0
elite            0.0
friends          0.0
fans             0.0

yelp_academic_dataset_review.json  ——  TOP NaN%
review_id      0.0
user_id        0.0
business_id    0.0
stars          0.0
useful         0.0
funny          0.0
cool           0.0
text           0.0
date           0.0


In [11]:
# 5 Check for non-standard state abbreviations
biz_full = pd.read_json(RAW/"yelp_academic_dataset_business.json", lines=True, nrows=150_000)
biz = biz_full[["city", "state"]]
invalid_state = biz.loc[~biz["state"].str.match(r"^[A-Z]{2}$", na=False), "state"].unique()
print("⚠️ Non-standard state abbreviations:", invalid_state[:10])

⚠️ Non-standard state abbreviations: ['XMS']


In [12]:
# 6 Check review stars and dates
rev_full = pd.read_json(RAW/"yelp_academic_dataset_review.json", lines=True, nrows=500_000)
rev = rev_full[["stars", "date"]].copy()
rev["date"] = pd.to_datetime(rev["date"], errors="coerce")
assert rev["stars"].between(1, 5).all(), "Invalid star rating found!"
print("Review date range:", rev["date"].min(), "to", rev["date"].max())

Review date range: 2005-03-01 17:47:15 to 2022-01-19 00:51:23
