Today we'll extend our dataset to a greater coverage, and craft it into an excellent dataset for training.

In [1]:
# imports

import os
import random
from dotenv import load_dotenv
from huggingface_hub import login
from datasets import load_dataset, Dataset, DatasetDict
from items import Item
from loaders import ItemLoader
import matplotlib.pyplot as plt
from collections import Counter, defaultdict
import numpy as np
import pickle

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
os.environ['HF_TOKEN'] = os.getenv('HF_TOKEN', '')
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', '')
os.environ['ANTHROPIC_API_KEY'] = os.getenv('ANTHROPIC_API_KEY', '')
os.environ['GOOGLE_API_KEY'] = os.getenv('GOOGLE_API_KEY', '')

In [3]:
# Log in to HuggingFace

hf_token = os.environ['HF_TOKEN']
login(hf_token, add_to_git_credential=True)

Token is valid (permission: write).
Your token has been saved in your configured git credential helpers (manager).
Your token has been saved to C:\Users\Shrian Singh\.cache\huggingface\token
Login successful


In [4]:
# Load in the same dataset as last time

items = ItemLoader("Appliances").load()

Loading dataset Appliances


100%|██████████| 95/95 [01:56<00:00,  1.23s/it] 


Completed Appliances with 28,625 datapoints in 2.6 mins


Finally
It's time to break down our data into a training, test and validation dataset.

In [5]:
train = items[:25000]
test = items[25000:]

In [6]:
print(train[0].prompt)

How much does this cost to the nearest dollar?

Rack Roller and stud assembly Kit (4 Pack) by AMI PARTS Replaces
PARTS NUMBER The dishwasher top rack wheels and stud assembly Kit （4 pcs） SCOPE OF APPLICATION The dishwasher works with most top name brands,If you are not sure if part is correct, ask us in Customer questions & answers section or visiting the AMI PARTS storefront.We’re happy to help ensure you select the correct part for your Rack Roller and stud REPLACES PART FIXES SYMPTOMS Door won’t close | Not cleaning dishes properly | Noisy | Door latch failure QUALITY WARRANTY The replacement part is made from durable high quality material and well-tested by manufacturer.For any reason you’re not satisfied,you can ask for a replacement or full refund Brand Name AMI PARTS, Model

Price is $9.00


In [9]:
print(test[0].test_prompt())

How much does this cost to the nearest dollar?

DPD Washer Lid Lock Latch Switch Assembly Fits for Maytag Centennial Washer Whirlpool Kenmore Washer Replaces
Part washer lid lock switch replaces： This washer lid lock replacement works with the following products Whirlpool, Maytag, Kenmore, Amana. Contact Us If you are not sure if part is correct, ask us in Customer questions & answers section or contact us by visiting the Discount Parts Direct storefront. Package Includes 1 x lid lock switch assembly is a 4-wire switch, 2 x bezels (white and grey), 1 x instructions Part numbers etc. Works For Brands washer lid lock replacement Compatible with Whirlpool, Kenmore, Amana,Maytag centennial washer. PREMIUM QUALITY Lid Lock Latch Switch detects if the washer

Price is $


#### Finally - upload your brand new dataset
#### Convert to prompts and upload to HuggingFace hub

In [10]:
train_prompts = [item.prompt for item in train]
train_prices = [item.price for item in train]
test_prompts = [item.test_prompt() for item in test]
test_prices = [item.price for item in test]

In [13]:
# Create a Dataset from the lists

train_dataset = Dataset.from_dict({"text": train_prompts, "price": train_prices})
test_dataset = Dataset.from_dict({"text": test_prompts, "price": test_prices})
dataset = DatasetDict({
    "train": train_dataset,
    "test": test_dataset
})

In [14]:
HF_USER = "MLsheenu"
DATASET_NAME = f"{HF_USER}/pricer-data"
dataset.push_to_hub(DATASET_NAME, private=True)

Creating parquet from Arrow format: 100%|██████████| 25/25 [00:00<00:00, 153.64ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:04<00:00,  4.63s/it]
Creating parquet from Arrow format: 100%|██████████| 4/4 [00:00<00:00, 139.14ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:02<00:00,  2.56s/it]


CommitInfo(commit_url='https://huggingface.co/datasets/MLsheenu/pricer-data/commit/1d092504eeac8d26e601a6cebcecf53dafecf817', commit_message='Upload dataset', commit_description='', oid='1d092504eeac8d26e601a6cebcecf53dafecf817', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/MLsheenu/pricer-data', endpoint='https://huggingface.co', repo_type='dataset', repo_id='MLsheenu/pricer-data'), pr_revision=None, pr_num=None)

In [15]:
# One more thing!
# Let's pickle the training and test dataset so we don't have to execute all this code next time!

with open('train.pkl', 'wb') as file:
    pickle.dump(train, file)

with open('test.pkl', 'wb') as file:
    pickle.dump(test, file)