# Part 1: Data Sampling

The Yelp Reviews dataset is large (~9GB) and is in JSON format, which can lead to high memory usage when processed directly. To make the dataset more manageable, we will focus on the `reviews` subset and reduce it to approximately 1.5 million reviews. This reduction allows us to retain a good portion of the data while keeping memory and computation demands within limits.

Fine-tuning BERT on medium-sized datasets such as this can lead to significant performance improvements. Reducing the dataset ensures it fits within system constraints, and 1.5 million reviews provides a sufficient sample size for building a content-based recommender system capable of delivering meaningful insights.

## Importing Dependencies

Since the `reviews` dataset is large, loading it directly using pandas can result in memory exhaustion, as pandas loads the entire dataset into memory at once. To handle the dataset more efficiently, we will use **Dask**, a parallel computing library that supports lazy evaluation. Dask is designed to manage larger datasets by loading data as needed, thus minimizing memory usage.

In [1]:
import pandas as pd
import dask.dataframe as dd

## Loading Dataset

We use gdown, which is a Python CLI-based tool to download files and folders from Google Drive.

In [2]:
!pip install -q gdown                                                             # Install gdown in quiet mode, suppressing logs

!gdown --folder 1j8RLxZYJfonGETCwlw1fUl1zaafAqtse -O /content/YelpDataset         # Download the folder with the id and save the
                                                                                  # results to /content/YelpDataset

Retrieving folder contents
Processing file 1PgywSzvyZ-LodDZReL9cwk3EoEfSBtAZ Dataset_User_Agreement.pdf
Processing file 167HeuRH581eFsso0PtmUyWoT1pHp4kUI yelp_academic_dataset_business.json
Processing file 16i_3i2jwOfJ0MStpq3uBnNPoFjkLRrt5 yelp_academic_dataset_checkin.json
Processing file 1ADMBGRjbkEfkRbqLCVisfs63uGh9c91o yelp_academic_dataset_review.json
Processing file 1g6WBceed7ntCNgAoV5hjnkulqkVBHDIJ yelp_academic_dataset_tip.json
Processing file 1tmKspFRt8vs3v0AFz-KcbEY9p1wrFSUP yelp_academic_dataset_user.json
Retrieving folder contents completed
Building directory structure
Building directory structure completed
Downloading...
From: https://drive.google.com/uc?id=1PgywSzvyZ-LodDZReL9cwk3EoEfSBtAZ
To: /content/YelpDataset/Dataset_User_Agreement.pdf
100% 80.4k/80.4k [00:00<00:00, 76.9MB/s]
Downloading...
From (original): https://drive.google.com/uc?id=167HeuRH581eFsso0PtmUyWoT1pHp4kUI
From (redirected): https://drive.google.com/uc?id=167HeuRH581eFsso0PtmUyWoT1pHp4kUI&confirm=t&uui

## Chunking Dataset

To reduce memory overhead, we will define a `chunk_size` variable to load the dataset in smaller parts. The dataset will be loaded in **parquet** format, which is more compact and efficient than JSON. The parquet format helps reduce the overall size of the dataset while maintaining structure.

In [3]:
chunk_size = 10_000
parquet_file = '/content/Dataset/yelp_reviews.parquet'

# The 'lines=True' argument ensures that each line in the file is treated as a separate JSON object

chunks = pd.read_json('/content/YelpDataset/yelp_academic_dataset_review.json', lines=True, chunksize=chunk_size)


We will convert the dataset to parquet format by appending the data in chunks. This approach helps keep memory usage low by processing the file incrementally. For that, we need to download the `fastparquet` engine. We also create the folder to save the parquet.

In [4]:
!pip install -q fastparquet

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.8 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━[0m [32m1.5/1.8 MB[0m [31m42.9 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.8/1.8 MB[0m [31m34.2 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m20.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [5]:
!mkdir /content/Dataset

In [6]:
for i, chunk in enumerate(chunks):
    chunk.to_parquet(parquet_file, engine='fastparquet', index=False, compression='snappy', append=(i > 0))
    print(f"Chunk {i} appended.")

Chunk 0 appended.
Chunk 1 appended.
Chunk 2 appended.
Chunk 3 appended.
Chunk 4 appended.
Chunk 5 appended.
Chunk 6 appended.
Chunk 7 appended.
Chunk 8 appended.
Chunk 9 appended.
Chunk 10 appended.
Chunk 11 appended.
Chunk 12 appended.
Chunk 13 appended.
Chunk 14 appended.
Chunk 15 appended.
Chunk 16 appended.
Chunk 17 appended.
Chunk 18 appended.
Chunk 19 appended.
Chunk 20 appended.
Chunk 21 appended.
Chunk 22 appended.
Chunk 23 appended.
Chunk 24 appended.
Chunk 25 appended.
Chunk 26 appended.
Chunk 27 appended.
Chunk 28 appended.
Chunk 29 appended.
Chunk 30 appended.
Chunk 31 appended.
Chunk 32 appended.
Chunk 33 appended.
Chunk 34 appended.
Chunk 35 appended.
Chunk 36 appended.
Chunk 37 appended.
Chunk 38 appended.
Chunk 39 appended.
Chunk 40 appended.
Chunk 41 appended.
Chunk 42 appended.
Chunk 43 appended.
Chunk 44 appended.
Chunk 45 appended.
Chunk 46 appended.
Chunk 47 appended.
Chunk 48 appended.
Chunk 49 appended.
Chunk 50 appended.
Chunk 51 appended.
Chunk 52 appended.
Chu

Once the dataset is converted to parquet, we can load it using Dask for further processing.

In [7]:
df = dd.read_parquet(parquet_file)

## Displaying Shape

Before performing any sampling, we will inspect the shape of the dataset. Initially, the dataset contains approximately 6.9 million rows and 9 columns.

In [8]:
print(f"Shape: ({len(df)}, {len(df.columns)})")

Shape: (6990280, 9)


## Mapping Stars to Sentiments

We can map the star ratings into sentiments as follows:

- **Negative**: Ratings between 1 and 3 stars (1 ≤ stars ≤ 2)

- **Neutral**: Ratings equal to 3 stars (stars = 3)

- **Positive**: Ratings between 4 and 5 stars (4 ≤ stars ≤ 5)

In [9]:
def map_sentiment(stars):
    return stars.map({1.0: "negative", 2.0: "negative", 3.0: "neutral", 4.0: "positive", 5.0: "positive"})

df["sentiment"] = df["stars"]. \
                    map_partitions(map_sentiment, meta=("sentiment", "category"))  # category is the datatype of resultant sentiment column

df.head(10)


Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date,sentiment
0,KU_O5udG6zpxOg-VcAEodg,mh_-eMZ6K5RLWhZyISBhwA,XQfwVwDr-v0ZS3_CbbE5Xw,3,0,0,0,"If you decide to eat here, just be aware it is...",2018-07-07 22:09:11,neutral
1,BiTunyQ73aT9WBnpR9DZGw,OyoGAe7OKpv6SyGZT5g77Q,7ATYjTIgM3jUlt4UM3IypQ,5,1,0,1,I've taken a lot of spin classes over the year...,2012-01-03 15:28:18,positive
2,saUsX_uimxRlCVr67Z4Jig,8g_iMtfSiwikVnbP2etR0A,YjUWPpI6HXG530lwP-fb2A,3,0,0,0,Family diner. Had the buffet. Eclectic assortm...,2014-02-05 20:30:30,neutral
3,AqPFMleE6RsU23_auESxiA,_7bHUi9Uuf5__HHc_Q8guQ,kxX2SOes4o-D3ZQBkiMRfA,5,1,0,1,"Wow! Yummy, different, delicious. Our favo...",2015-01-04 00:01:03,positive
4,Sx8TMOWLNuJBWer-0pcmoA,bcjbaE6dDog4jkNY91ncLQ,e4Vwtrqf-wpJfwesgvdgxQ,4,1,0,1,Cute interior and owner (?) gave us tour of up...,2017-01-14 20:54:15,positive
5,JrIxlS1TzJ-iCu79ul40cQ,eUta8W_HdHMXPzLBBZhL1A,04UD14gamNjLY0IDYVhHJg,1,1,2,1,I am a long term frequent customer of this est...,2015-09-23 23:10:31,negative
6,6AxgBCNX_PNTOxmbRSwcKQ,r3zeYsv1XFBRA4dJpL78cw,gmjsEdUsKpj9Xxu6pdjH0g,5,0,2,0,Loved this tour! I grabbed a groupon and the p...,2015-01-03 23:21:18,positive
7,_ZeMknuYdlQcUqng_Im3yg,yfFzsLmaWF2d4Sr0UNbBgg,LHSTtnW3YHCeUkRDGyJOyw,5,2,0,0,Amazingly amazing wings and homemade bleu chee...,2015-08-07 02:29:16,positive
8,ZKvDG2sBvHVdF5oBNUOpAQ,wSTuiTk-sKNdcFyprzZAjg,B5XSoSG3SfvQGtKEGQ1tSQ,3,1,1,0,This easter instead of going to Lopez Lake we ...,2016-03-30 22:46:33,neutral
9,pUycOfUwM8vqX7KjRRhUEA,59MxRhNVhU9MYndMkz0wtw,gebiRewfieSdtt17PTW6Zg,3,0,0,0,Had a party of 6 here for hibachi. Our waitres...,2016-07-25 07:31:06,neutral


## Displaying Dataset

We display the dataset and its shape again to confirm data integrity and ensure that no corruption occured during conversion.

In [10]:
print(f"{len(df), len(df.columns)}")
df.head(10)

(6990280, 10)


Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date,sentiment
0,KU_O5udG6zpxOg-VcAEodg,mh_-eMZ6K5RLWhZyISBhwA,XQfwVwDr-v0ZS3_CbbE5Xw,3,0,0,0,"If you decide to eat here, just be aware it is...",2018-07-07 22:09:11,neutral
1,BiTunyQ73aT9WBnpR9DZGw,OyoGAe7OKpv6SyGZT5g77Q,7ATYjTIgM3jUlt4UM3IypQ,5,1,0,1,I've taken a lot of spin classes over the year...,2012-01-03 15:28:18,positive
2,saUsX_uimxRlCVr67Z4Jig,8g_iMtfSiwikVnbP2etR0A,YjUWPpI6HXG530lwP-fb2A,3,0,0,0,Family diner. Had the buffet. Eclectic assortm...,2014-02-05 20:30:30,neutral
3,AqPFMleE6RsU23_auESxiA,_7bHUi9Uuf5__HHc_Q8guQ,kxX2SOes4o-D3ZQBkiMRfA,5,1,0,1,"Wow! Yummy, different, delicious. Our favo...",2015-01-04 00:01:03,positive
4,Sx8TMOWLNuJBWer-0pcmoA,bcjbaE6dDog4jkNY91ncLQ,e4Vwtrqf-wpJfwesgvdgxQ,4,1,0,1,Cute interior and owner (?) gave us tour of up...,2017-01-14 20:54:15,positive
5,JrIxlS1TzJ-iCu79ul40cQ,eUta8W_HdHMXPzLBBZhL1A,04UD14gamNjLY0IDYVhHJg,1,1,2,1,I am a long term frequent customer of this est...,2015-09-23 23:10:31,negative
6,6AxgBCNX_PNTOxmbRSwcKQ,r3zeYsv1XFBRA4dJpL78cw,gmjsEdUsKpj9Xxu6pdjH0g,5,0,2,0,Loved this tour! I grabbed a groupon and the p...,2015-01-03 23:21:18,positive
7,_ZeMknuYdlQcUqng_Im3yg,yfFzsLmaWF2d4Sr0UNbBgg,LHSTtnW3YHCeUkRDGyJOyw,5,2,0,0,Amazingly amazing wings and homemade bleu chee...,2015-08-07 02:29:16,positive
8,ZKvDG2sBvHVdF5oBNUOpAQ,wSTuiTk-sKNdcFyprzZAjg,B5XSoSG3SfvQGtKEGQ1tSQ,3,1,1,0,This easter instead of going to Lopez Lake we ...,2016-03-30 22:46:33,neutral
9,pUycOfUwM8vqX7KjRRhUEA,59MxRhNVhU9MYndMkz0wtw,gebiRewfieSdtt17PTW6Zg,3,0,0,0,Had a party of 6 here for hibachi. Our waitres...,2016-07-25 07:31:06,neutral


## Sampling Dataset

Since our target is to sample 1.5 million rows, we need to ensure randomness along with class balance across all three sentiments (positive, negative, neutral). Thus, we need 500,000 rows for each sentiment.

In [11]:
target_per_class = 500_000
seed = 42                                                           # Ensures randomness but reproducibility
sentiment_counts = df['sentiment'].value_counts().compute()         # Caclulates count for each sentiment in entire dataset

Next, we divide the number of rows per sentiment class by the total length of the dataset. And then sample approximately 500,000 rows per class. For each sentiment, we sample the 500,000 rows and then append them to create a `balanced_df`.

In [12]:
sampled_dfs = []

for sentiment in ["negative", "neutral", "positive"]:
    sentiment_df = df[df.sentiment == sentiment]

    frac = (target_per_class + 10) / len(sentiment_df)                          # The + 10 is set as an offset, since the frac variable
                                                                                # works with floating point calculations

    sampled_df = sentiment_df.sample(frac=frac, random_state=seed).compute()

    sampled_df = sampled_df.sample(n=target_per_class, random_state=seed)

    sampled_dfs.append(sampled_df)

balanced_df = dd.from_pandas(pd.concat(sampled_dfs), npartitions=df.npartitions)


print(balanced_df['sentiment'].value_counts().compute())


sentiment
positive    500000
neutral     500000
negative    500000
Name: count, dtype: int64[pyarrow]


## Saving the Dataset

Lastly, we save the `balanced_df` to parquet for preprocessing.

In [13]:
# engine='pyarrow' that is part of Apache Arrow, used to read parquet files.

balanced_df.to_parquet("yelp_reviews_sampled.parquet", engine="pyarrow", write_index=False)