## GENERATING PRODUCTS SYNTHETIC DATA

The notebook's purpose is to generate synthetic data from a fake retail company focused on the field of clothing, in this notebook you will find a simple simulation from this company in the international {mercado} about 2 years

In [2]:
!pip install langchain-google-genai==2.0.9 --break-system-packages

Defaulting to user installation because normal site-packages is not writeable


In [3]:
import pandas as pd
import numpy as np
import random
import sys

sys.path.append('../../libraries')

import utils

Getting the distribution of sales by season on each country in which the retail company have stores

In [4]:
distribution_by_cat = utils.load_data('distribution_by_category.csv', '../../data')
sites = distribution_by_cat.country.unique()
distribution_by_cat.sample(5)

Unnamed: 0,country,consumption,category,season
220,USA,0.3,Activewear,Summer
144,Germany,0.25,Dresses,Summer
13,France,0.25,Outerwear,Fall
199,Mexico,0.1,Swimwear,Fall
157,Japan,0.4,Dresses,Fall


Doing the same process for distribution of sales based on USA sales

In [5]:
distribution_of_sales = utils.load_data('distribution_of_sales_by_country.csv', '../../data')
distribution_of_sales.sample(5)

Unnamed: 0,country,Winter,Spring,Summer,Fall
0,USA,1.0,1.0,1.0,1.0
3,France,0.85,1.25,0.05,0.85
8,Brazil,0.7,0.65,1.35,0.85
1,Canada,0.7,0.85,0.95,1.25
4,Germany,0.9,0.9,0.8,0.9


### Definition of records
Defining the structure for the records to add them to a csv file

- `product`
  - gtin
  - productCode
  - label
  - size
  - color
  - category

In [6]:
from langchain_google_genai import GoogleGenerativeAI
import os
import time
APIKEY = "AIzaSyA0LSN9eEx23wIUpvJ1P_SgHz0oLh3Ipls"
os.environ["GOOGLE_API_KEY"] = APIKEY

  from .autonotebook import tqdm as notebook_tqdm


In [7]:
llm = GoogleGenerativeAI(model="gemini-2.0-flash")

#### Generating products

In [8]:
def format_product(arr):
  data = {
    'gtin': [],
    'productCode': [],
    'size': [],
    'color': [],
    'label': [],
    'category': []
  }
  for row in arr:
    splitted_data = row.split(',')
    if len(splitted_data) != 6: continue
    i = 0
    for key in data.keys():
      data[key].append(splitted_data[i])
      i+=1
  return pd.DataFrame(data)


In [9]:
def generate_products(df, n = 200, batch_size=50):
  for i in range(n//batch_size + 1):
    with pd.option_context('display.max_rows', None, 'display.max_columns', None):
      previous_result = df.to_string()
    response = llm.invoke(f'''
Generate a list of {batch_size} unique clothing products, each represented as a row with the following structure: gtin, productCode, size, color, label, category

* **gtin:** A unique 13-digit numeric identifier.
* **productCode:** A short alphanumeric code (e.g., CLOTH-001).
* **size:** One of the following: XS, S, M, L, XL.
* **color:** A common color name (e.g., Red, Blue, Black).
* **label:** A detailed, descriptive name of the clothing product.
* **category:** Select the category which it belongs

Categorize each product into one of these categories: {distribution_by_cat.category.unique()}.

Consider the geographical context of these countries: {distribution_by_cat.country.unique()}. However, do not include the country name in the product label.

The output should be formatted as a list of rows, with each row representing a product. Separate each product with a newline character (`\n`).

Ensure that the generated products are entirely new and do not overlap with any previously generated products, which are listed below:

{previous_result}

Output only the generated product data, formatted as described.
    ''')
    arr = response.split('\n')

    auxiliar_df = format_product(arr)

    df = pd.concat([df, auxiliar_df], ignore_index=True)
    time.sleep(2) # google API request about 1 minute between queries in the free tier
  return df

In [10]:
products = format_product(['8762109876543,ACC-007,XL,Gold,Chain Necklace','Accessories']) # example
products = generate_products(products)
products.sample(5)

Unnamed: 0,gtin,productCode,size,color,label,category
13,9780123456918,TRACK-014,L,Black,Track Pants with Zippered Pockets,Activewear
432,9780123461103,BELT-433,XL,Cognac,Braided Leather Belt,Accessories
161,9780123458394,COAT-162,L,Charcoal,Herringbone Wool Coat,Outerwear
425,9780123461035,POLO-426,XL,Sky Blue,Performance Golf Polo,Tops
126,9780123458042,CAPRI-127,S,Black,Cropped Leggings,Bottoms


In [11]:
utils.save_data(products, 'products.csv', '../../data')

Data saved to: /mnt/sda2/ICC/pasantia/final-project/data/products.csv
