# Introduction to Ray Data 

In this problem, we introduce Ray Data, and how to work with Ray Datasets. We highly encourage you to first go over [documentation](https://docs.ray.io/en/latest/data/data.html) for Ray Data. 

The dataset we will use for this problems is the Electronics subset of the Amazon Reviews dataset. This dataset has been provided to you in parquet format at path ``~/public/pa2d_dev``In the first section, you will use the ``read_parquet`` method to read your parquet dataset into a  Ray.data.Dataset object. Ray Data uses Ray Tasks to read files in parallel. [This](https://docs.ray.io/en/latest/data/data-internals.html) is a useful resource to understand how data loading works.

# Prereq: place the data in `private` 
For the `ray-notebook` server, JupyterHub runs on the head node, with worker nodes having access only to the `private` folder. Thus, you will need to copy over the dataset to your `private` folder as: 

`cp -r ~/public/pa2_dev ~/private` 

In [1]:
import ray
import re
ray.shutdown()
ray.init()

ds = ray.data.read_parquet("~/private/pa2_dev")

2024-03-04 11:39:19,088	INFO util.py:154 -- Outdated packages:
  ipywidgets==7.8.1 found, needs ipywidgets>=8
Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
2024-03-04 11:39:19,096	INFO worker.py:1540 -- Connecting to existing Ray cluster at address: 10.47.192.23:6380...
2024-03-04 11:39:19,137	INFO worker.py:1715 -- Connected to Ray cluster. View the dashboard at [1m[32mhttp://10.47.192.23:8265 [39m[22m
2024-03-04 11:39:20,135	INFO util.py:154 -- Outdated packages:
  ipywidgets==7.8.1 found, needs ipywidgets>=8
Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


Metadata Fetch Progress 0:   0%|          | 0/20 [00:00<?, ?it/s]

Parquet Files Sample 0:   0%|          | 0/2 [00:00<?, ?it/s]

In [2]:
#Print out the schema of the dataset
ds.schema()

Column          Type
------          ----
reviewTime      string
reviewerName    string
summary         string
unixReviewTime  int64
asin            string
reviewText      string
reviewerID      string
verified        bool
overall         double

# What is `num_blocks` here?
Go through the documentation listed at the top on this!

In [3]:
ds.num_blocks()

120

In [4]:
#view the first 5 entries using the Dataset.take() function 
# YOUR CODE HERE
ds.take(5)

2024-03-04 11:39:23,347	INFO dataset.py:2488 -- Tip: Use `take_batch()` instead of `take() / show()` to return records in pandas or numpy batch format.
2024-03-04 11:39:23,361	INFO set_read_parallelism.py:115 -- Using autodetected parallelism=200 for stage ReadParquet to satisfy DataContext.get_current().min_parallelism=200.
2024-03-04 11:39:23,361	INFO set_read_parallelism.py:122 -- To satisfy the requested parallelism of 200, each read task output is split into 2 smaller blocks.
2024-03-04 11:39:23,362	INFO streaming_executor.py:112 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet] -> LimitOperator[limit=5]
2024-03-04 11:39:23,363	INFO streaming_executor.py:113 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=0, gpu=0, object_store_memory=0), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024

Running 0:   0%|          | 0/120 [00:00<?, ?it/s]

[{'reviewTime': '02 5, 2018',
  'reviewerName': 'Barry',
  'summary': 'Great radio with one BIG flaw!',
  'unixReviewTime': 1517788800,
  'asin': 'B0012YFY54',
  'reviewText': "I have another Sangean radio which love, so I was excited to get this one to take camping, but I soon discovered that I couldn't pack it in anything without it turning on 'by itself' due to the fact that the on button is to vulnerable ant pressure. It should have been recessed below the surface or sliding switch or other. I decoded to try packing it in its original styrofoam and box, but it still turns on with lightest pressure. In fact it kept turning on even on the drive home. It is a great radio, so I will probably try to make something I can store it in that will prevent this.",
  'reviewerID': 'A1305VN9IRGZI3',
  'verified': True,
  'overall': 3.0},
 {'reviewTime': '07 19, 2015',
  'reviewerName': 'arl6969',
  'summary': 'I like this product cause you can put alot of things ...',
  'unixReviewTime': 1437264

# Adding a column 
To add a column to a Ray Dataset, we use the ``Dataset.add_column()`` method, documentation for which can be found [here](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.add_column.html)

In [5]:
#add a column called id to your dataframe, where we number each of our entries from 0 to ds.count()
# YOUR CODE HERE
def add_id(ds):
    ds['id'] = range(len(ds))
    return ds

ds = ds.add_column('id', add_id)
ds = ds.materialize() #why did we do this? Read the cell below. 
ds.take(5)

2024-03-04 11:39:24,132	INFO set_read_parallelism.py:115 -- Using autodetected parallelism=200 for stage ReadParquet to satisfy DataContext.get_current().min_parallelism=200.
2024-03-04 11:39:24,133	INFO set_read_parallelism.py:122 -- To satisfy the requested parallelism of 200, each read task output is split into 2 smaller blocks.
2024-03-04 11:39:24,134	INFO streaming_executor.py:112 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet] -> TaskPoolMapOperator[MapBatches(process_batch)]
2024-03-04 11:39:24,135	INFO streaming_executor.py:113 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=0, gpu=0, object_store_memory=0), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-03-04 11:39:24,135	INFO streaming_executor.py:115 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_curren

Running 0:   0%|          | 0/120 [00:00<?, ?it/s]

2024-03-04 11:39:42,031	INFO streaming_executor.py:112 -- Executing DAG InputDataBuffer[Input] -> LimitOperator[limit=5]
2024-03-04 11:39:42,032	INFO streaming_executor.py:113 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=0, gpu=0, object_store_memory=0), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-03-04 11:39:42,033	INFO streaming_executor.py:115 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/240 [00:00<?, ?it/s]

[{'reviewTime': '09 18, 1999',
  'reviewerName': 'D. C. Carrad',
  'summary': 'A star is born',
  'unixReviewTime': 937612800,
  'asin': '0151004714',
  'reviewText': 'This is the best novel I have read in 2 or 3 years.  It is everything that fiction should be -- beautifully written, engaging, well-plotted and structured.  It has several layers of meanings -- historical, family,  philosophical and more -- and blends them all skillfully and interestingly.  It makes the American grad student/writers\' workshop "my parents were  mean to me and then my professors were mean to me" trivia look  childish and silly by comparison, as they are.\nAnyone who says this is an  adolescent girl\'s coming of age story is trivializing it.  Ignore them.  Read this book if you love literature.\nI was particularly impressed with  this young author\'s grasp of the meaning and texture of the lost world of  French Algeria in the 1950\'s and \'60\'s...particularly poignant when read in  1999 from another ruine

# Lazy Execution in Ray 

As you may have noticed, we added a ``ds.materialize()`` command in the cell above. We do this because the default execution mode in Ray Data is Lazy and Streaming execution. You should read more about it [here](https://docs.ray.io/en/latest/data/data-internals.html#execution). We call ``materialize`` here to execute the ``add_column`` transformation on the entire dataset. 

# Compute Statistics
Just like pandas, we can compute some statistics on our data using inbuilt functions like mean, min and max for columns in our Dataset

In [6]:
#Calculate median of the overall rating, and mean of the vote count using inbuilt Dataset methods. 

# YOUR CODE HERE
mean = ds.mean("overall")
max = ds.max("overall")
min = ds.min('overall')

print("mean of overall rating: ", mean)
print("max overall rating: ", max)
print("min overall rating: ", min)

# raise NotImplementedError()

2024-03-04 11:39:42,178	INFO streaming_executor.py:112 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]
2024-03-04 11:39:42,179	INFO streaming_executor.py:113 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=0, gpu=0, object_store_memory=0), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-03-04 11:39:42,180	INFO streaming_executor.py:115 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


- Aggregate 1:   0%|          | 0/240 [00:00<?, ?it/s]

Shuffle Map 2:   0%|          | 0/240 [00:00<?, ?it/s]

Shuffle Reduce 3:   0%|          | 0/240 [00:00<?, ?it/s]

Running 0:   0%|          | 0/240 [00:00<?, ?it/s]

2024-03-04 11:39:44,688	INFO streaming_executor.py:112 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]
2024-03-04 11:39:44,689	INFO streaming_executor.py:113 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=0, gpu=0, object_store_memory=0), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-03-04 11:39:44,690	INFO streaming_executor.py:115 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


- Aggregate 1:   0%|          | 0/240 [00:00<?, ?it/s]

Shuffle Map 2:   0%|          | 0/240 [00:00<?, ?it/s]

Shuffle Reduce 3:   0%|          | 0/240 [00:00<?, ?it/s]

Running 0:   0%|          | 0/240 [00:00<?, ?it/s]

2024-03-04 11:39:47,004	INFO streaming_executor.py:112 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]
2024-03-04 11:39:47,005	INFO streaming_executor.py:113 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=0, gpu=0, object_store_memory=0), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-03-04 11:39:47,006	INFO streaming_executor.py:115 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


- Aggregate 1:   0%|          | 0/240 [00:00<?, ?it/s]

Shuffle Map 2:   0%|          | 0/240 [00:00<?, ?it/s]

Shuffle Reduce 3:   0%|          | 0/240 [00:00<?, ?it/s]

Running 0:   0%|          | 0/240 [00:00<?, ?it/s]

mean of overall rating:  4.267620789793895
max overall rating:  5.0
min overall rating:  1.0


# Preprocessors in Ray Data

Ray data is a part of the Ray AI Runtime system, and is built to be a scalable data processing library for ML applications. Hence, it has a rich library of various common preprocessors we require to use while serving ML models. [Here](https://docs.ray.io/en/latest/data/preprocessors.html#data-preprocessors) is how the inbuilt preprocessors work.

In [7]:
#Scale each 'overall' using it's maximum absolute value using the MaxAbsScaler

# YOUR CODE HERE
from ray.data.preprocessors import MaxAbsScaler
preprocessor = MaxAbsScaler(columns=["overall"])
ds = preprocessor.fit_transform(ds)
# raise NotImplementedError()

ds.take(5)

2024-03-04 11:39:49,627	INFO streaming_executor.py:112 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]
2024-03-04 11:39:49,629	INFO streaming_executor.py:113 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=0, gpu=0, object_store_memory=0), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-03-04 11:39:49,630	INFO streaming_executor.py:115 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


- Aggregate 1:   0%|          | 0/240 [00:00<?, ?it/s]

Shuffle Map 2:   0%|          | 0/240 [00:00<?, ?it/s]

Shuffle Reduce 3:   0%|          | 0/240 [00:00<?, ?it/s]

Running 0:   0%|          | 0/240 [00:00<?, ?it/s]

2024-03-04 11:46:57,269	INFO streaming_executor.py:112 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(MaxAbsScaler._transform_pandas)] -> LimitOperator[limit=5]
2024-03-04 11:46:57,270	INFO streaming_executor.py:113 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=0, gpu=0, object_store_memory=0), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-03-04 11:46:57,270	INFO streaming_executor.py:115 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/240 [00:00<?, ?it/s]

[{'reviewTime': '08 21, 2015',
  'reviewerName': 'Timothy Fenton',
  'summary': 'Bigger, heavier & bulkier than expected. "Travel" since ...',
  'unixReviewTime': 1440115200,
  'asin': 'B0015DYMVO',
  'reviewText': 'Bigger, heavier & bulkier than expected.  "Travel" since it\'s smaller than a regular surge protector, but not really all that mini.  Not pictured, there\'s a the plug cover seems to be intended to be saved... and there\'s a little slot at the bottom of the plug to stick it in when the surge protector is plugged in.  It\'s surely going to be lost on my first trip, but not a big deal.  But if they want you to keep it, should be somehow attached or have a more logical storage spot.\n\nThe button you push to swivel the plug 360 degrees is tough to push in...  Overall, meh.',
  'reviewerID': 'AEC8L8URETQ3K',
  'verified': True,
  'overall': 0.6,
  'id': 0},
 {'reviewTime': '08 20, 2015',
  'reviewerName': 'Jose E Alsina Acosta',
  'summary': 'Five Stars',
  'unixReviewTime': 14

# Applying a transform over the entire dataset

To apply a function to the entire dataset, we use the ``Dataset.map()`` method. It transforms the dataset row-wise in accordance to the function you pass into it. Dataset.map uses ray tasks to transform the blocks of the dataset. [Here](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.map.html#ray.data.Dataset.map) is an example of how to use it.

In [8]:
#Create a function named lowercase() that accepts a single row of data as input and 
#converts the text in the 'summary' column of each row to lowercase letters
#Use Dataset.map() to apply this function over the entire dataset. Print the first 5 entries.

# YOUR CODE HERE
def lowercase(row):
    row["summary"] = row['summary'].lower()
    return row

ds = ds.map(lowercase)
# raise NotImplementedError()
ds.take(5)

2024-03-04 11:46:57,934	INFO streaming_executor.py:112 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(MaxAbsScaler._transform_pandas)->Map(lowercase)] -> LimitOperator[limit=5]
2024-03-04 11:46:57,935	INFO streaming_executor.py:113 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=0, gpu=0, object_store_memory=0), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-03-04 11:46:57,935	INFO streaming_executor.py:115 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/240 [00:00<?, ?it/s]

[{'reviewTime': '02 11, 2009',
  'reviewerName': 'J. Peterson',
  'summary': 'lg lcd 42" 1080',
  'unixReviewTime': 1234310400,
  'asin': 'B0016PCPNS',
  'reviewText': 'Wonderful and easy to set up.  It took me longer to attach the stand than to turn it on, go through the few menus and start watching TV.\n\nSpeakers are not great but still more than ample.\n\nNo problem with side views.  Picture is good from the start up and better with adjustments.\n\nThe CD was helpful for fine tuning it.\n\nI would buy it again and tell my friends to buy it!',
  'reviewerID': 'A15ZD6SCBJEXX1',
  'verified': False,
  'overall': 1.0,
  'id': 0},
 {'reviewTime': '12 31, 2008',
  'reviewerName': 'Amazon Customer',
  'summary': 'surprisingly solid television',
  'unixReviewTime': 1230681600,
  'asin': 'B0016PCPNS',
  'reviewText': "I selected this television from among some bigger name brands in the store because it seemed to have the sharpest picture, which surprised me. I looked at Sony, Samsung, etc, 

# Applying a vectorized transformation over the entire dataset

If your transformation can be vectorized, i.e applied to multiple rows at ones, you can apply that transform over batches. You do so using the ``Dataset.map_batches()`` method. [Here](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.map_batches.html) is an example how to use it. PS- set the parameter ``batch_format`` to ``"pandas"`` for this dataset while using ``map_batches()`` to avoid Ray-data specific issues.

In [9]:
# Currently, our data features reviews rated on a scale of 0 to 1. 
# To adjust the scale to -1 to 1, develop a method called scale.  
# This method should accept a batch of size 128 and modify its scale from 0-1 to -1 to 1. Print the first 5 entries.

def scale(row):
    row["overall"] = row["overall"] * 2 - 1
    return row

ds = ds.map_batches(scale, batch_size = 128, batch_format = "pandas")
# # YOUR CODE HERE
# raise NotImplementedError()
ds.take(5)

2024-03-04 11:48:44,864	INFO streaming_executor.py:112 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(MaxAbsScaler._transform_pandas)->Map(lowercase)->MapBatches(scale)] -> LimitOperator[limit=5]
2024-03-04 11:48:44,865	INFO streaming_executor.py:113 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=0, gpu=0, object_store_memory=0), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-03-04 11:48:44,866	INFO streaming_executor.py:115 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/240 [00:00<?, ?it/s]

[{'reviewTime': '01 27, 2005',
  'reviewerName': 'theberad',
  'summary': 'deserves the 5 star rating, reviewing it for what it is',
  'unixReviewTime': 1106784000,
  'asin': 'B0002ZAILY',
  'reviewText': "No, it doesn't have a display. No, you couldn't get at 512 MB player with a display for $99. The Shuffle breaks new ground in flash memory players not only because it is an equisite player (you gotta hold one in your hand and be baffled at the big sound coming out of such a small piece of plastic), but because of what iTunes has done with it. The Autofill function is awesome, and perfect for what the Shuffle is. I didn't want to take my 40 Gig to the gym for fear of dropping it... the shuffle is fantastic for that. Plus, I can listen to audible books on it and quickly get to them by just starting at the beginning in non-shuffle mode.\n\nThis really is just a phenomenal product, and I haven't had any complaints... on the contrary, I'm surprised at everything that I can make it do with

# Cleaning up reviewText

Write a function called ``preprocessor()`` which takes in a batch of size 128. You should convert the ``reviewText``  in each row into lowercase letters, remove all punctuation (we suggest using regex), and tokenize the sentence. A GPT-2 tokenizer has been instantiated for you, and you should use the tokenizer.encode() method to tokenize each `reviewText`. Add these tokenized representations to your dataset under the column ``tokenizedText``. 
(You might have to add the new column beforehand)

Apply this preprocessor transform using map_batches.
Print the first 5 entries of this transformed dataset.


In [10]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

In [11]:
# write a `preprocessor` function that can tokenize text for a batch of data. 
# store the result of the map_batches in `transformed`. Use map_batches again.
# YOUR CODE HERE
import re

ds = ds.add_column('tokenizedText', lambda _: None)

def preprocessor(row):
    row['reviewText'] = row['reviewText'].fillna('').astype(str)
    row['reviewText'] = row['reviewText'].apply(lambda x:x.lower())
    row['reviewText'] = row['reviewText'].apply(lambda x: re.sub(r'[^\w\s]','',x))
    row['tokenizedText'] = row['reviewText'].apply(lambda x:tokenizer.encode(x))
    return row

transformed = ds.map_batches(preprocessor, batch_size = 128, batch_format = "pandas")
    
# raise NotImplementedError()

transformed_results = transformed.take(5)


2024-03-04 11:50:22,509	INFO streaming_executor.py:112 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(MaxAbsScaler._transform_pandas)->Map(lowercase)->MapBatches(scale)->MapBatches(process_batch)->MapBatches(preprocessor)] -> LimitOperator[limit=5]
2024-03-04 11:50:22,510	INFO streaming_executor.py:113 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=0, gpu=0, object_store_memory=0), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-03-04 11:50:22,511	INFO streaming_executor.py:115 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/240 [00:00<?, ?it/s]

[36m(MapBatches(MaxAbsScaler._transform_pandas)->Map(lowercase)->MapBatches(scale)->MapBatches(process_batch)->MapBatches(preprocessor) pid=11074)[0m Token indices sequence length is longer than the specified maximum sequence length for this model (1263 > 1024). Running this sequence through the model will result in indexing errors


In [12]:
transformed_results # inspect results

[{'reviewTime': '01 27, 2005',
  'reviewerName': 'theberad',
  'summary': 'deserves the 5 star rating, reviewing it for what it is',
  'unixReviewTime': 1106784000,
  'asin': 'B0002ZAILY',
  'reviewText': 'no it doesnt have a display no you couldnt get at 512 mb player with a display for 99 the shuffle breaks new ground in flash memory players not only because it is an equisite player you gotta hold one in your hand and be baffled at the big sound coming out of such a small piece of plastic but because of what itunes has done with it the autofill function is awesome and perfect for what the shuffle is i didnt want to take my 40 gig to the gym for fear of dropping it the shuffle is fantastic for that plus i can listen to audible books on it and quickly get to them by just starting at the beginning in nonshuffle mode\n\nthis really is just a phenomenal product and i havent had any complaints on the contrary im surprised at everything that i can make it do without the screen seriously i d

In [13]:
#check if you have tokenized your text correctly 
decode_txt=tokenizer.decode(transformed_results[0]["tokenizedText"])

assert decode_txt == transformed_results[0]['reviewText'] 

2024-03-04 11:52:09.645834: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-04 11:52:10.966887: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2024-03-04 11:52:10.966993: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
