## The INSPIRe Framework: A Reminder

**INSPIRe** is a framework designed to help you outsource code-generation to move faster, learn more, and tackle more projects.<br><BR>It ensures that your model generates functional, optimized, and robust code snippets. Derived from pure trial and error, INSPIRe is not merely a theoretical concept forced into action. Instead, it emerges from real-life tricks, embedding reliability and efficiency at its core.

I introduced the framework in the first article, but here's a concise reminder. INSPIRe is an iterative process comprising six steps:

- **Identify (I)**: Determine what you want to do and what you need to do it. Begin with a clear first prompt.
- **Narrate (N)**: Convert your instructions into code through detailed prompts.
- **Screen (S)**: Review each code snippet and correct the errors. Test and adjust.
- **Polish (P)**: Refine your code to improve it. Iterate as much as you need.
- **Integrate (I)**: Assemble your code snippets into a cohesive and elegant program.
- **Restart (Re)**: Start a new loop when you're done or when you hit a wall.

The goal is to break down your coding task into manageable snippets, generating and refining one at a time. Through the INSPIRe framework, you may find yourself iterating multiple times within a single step to achieve the desired quality and functionality.

#### Code Generation and Iterations

To create the code snippets below, I went through over 15 INSPIRe iterations. We won't detail each one of them. Instead, we'll focus on the key steps, showing where we made major improvements. The goal is to understand the logic of INSPIRe in the context of real-life situation.

#### Importing packages
For simplicity, we'll import all the relevant packages we'll use in the different snippets. The idea is to keep the import as a reference you can come back to whenever you have questions about what each package does.


# Import relevant packages and display their versions

In [1]:
import json
import random
import math
import sys
from io import StringIO

import pandas as pd # pandas==1.5.3 
import octoai # octoai==0.8.3
from octoai.client import Client

In the previous cell, we are setting up our environment by importing the necessary packages for our synthetic data generator. Here's what each import does:

- `json`: A built-in Python library for parsing and manipulating JSON data.
- `random`: Provides functions that support random number generation.
- `math`: Offers access to mathematical functions like square root, factorial, etc.
- `sys`: Used for interacting with the Python runtime environment.
- `StringIO` from `io`: Allows for reading and writing strings as files, which is useful for handling in-memory text streams.
- `pandas`: A powerful data manipulation and analysis library, here imported as `pd` for convenience.
- `octoai`: A third-party library (presumed here for synthetic data generation or analysis), imported to access its features and functions.
- `Client` from `octoai.client`: Specifically imports the `Client` class for establishing connections or sessions in the `octoai` framework.

Each of these imports is crucial for the functionality of our synthetic data generator, providing various tools and functionalities required throughout the process. <br>
<br>
When you generate code with a Large Language Model (LLM), the model suggests the necessary packages based on the code's context and functionality. In this Jupyter Notebook, we've listed the packages explicitly at the beginning to provide clarity on the dependencies required for the synthetic data generator. This way, you can easily refer back to this cell to recall the purpose and functionality of each package used in the project.


In [2]:
# Function to print the version of a each module you import (you never know!)

def get_version(module, default="Unknown"):
    return getattr(module, '__version__', default)

# For standard library modules, the version is essentially the Python version
standard_libs = {
    'json': json,
    'random': random,
    'math': math,
    'io.StringIO': StringIO  # Demonstrating how you might handle submodules or classes
}

# External libraries
external_libs = {
    'octoai': octoai,
    'pandas': pd,
}

print("Python version:", sys.version)

for name, module in standard_libs.items():
    print(f"{name} version: Python Standard Library (Python {sys.version_info.major}.{sys.version_info.minor})")

for name, module in external_libs.items():
    print(f"{name} version:", get_version(module))

Python version: 3.11.5 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:26:23) [MSC v.1916 64 bit (AMD64)]
json version: Python Standard Library (Python 3.11)
random version: Python Standard Library (Python 3.11)
math version: Python Standard Library (Python 3.11)
io.StringIO version: Python Standard Library (Python 3.11)
octoai version: 0.8.3
pandas version: 1.5.3


The previous cell contains a function and code to print the versions of all the imported modules. This can be particularly useful for debugging or ensuring compatibility. Here's what happens step by step:

- The `get_version` function is defined to fetch the version of a given module, returning "Unknown" if the version attribute is not found.
- `standard_libs` dictionary: Maps the name of standard library modules to their imported objects. It’s used here to showcase how to handle versioning in Python standard libraries.
- `external_libs` dictionary: Similar to `standard_libs`, but for external libraries like `octoai` and `pandas`.
- The Python version is printed directly using `sys.version`.
- For each module in `standard_libs`, the function prints the module name along with a custom string indicating it is part of the Python Standard Library and its version.
- Similarly, for each module in `external_libs`, it prints the module name and its version using the `get_version` function.

This approach ensures transparency about the environment in which the synthetic data generator is running, potentially aiding in understanding behavior or resolving issues related to version mismatches.


# Let's write the first prompt, the most important one

### Example of a Templated Prompt during the IDENTIFY Step

#### Role
- **Act like an expert software engineer who specializes in `<programming_language>`.**
- Your role is to help me achieve the following objective: `<objective>`.

#### Guidelines
- Write elegant and functional code in `<programming_language>`.
- Reason step by step to make sure you understand the user intentions before you turn them into elegant code.
- Make sure the code you write and/or edit is clear and well-commented.
- Write code that adheres to the best practices in `<programming_language>`.
- When you first respond, acknowledge the instructions you've been given then ask the user to describe the intentions they want to turn into code.

#### Specifics
- When the user indicates specific syntax and functions, make sure to remember the provided syntax and use it for the code you generate.
- Assume the user provides the correct syntax but always verify indentations, symbols, and punctuation.
- `<more_specifics>` 

#### Format
- Give a clear title to each code snippet you generate.
  For example, you can title the first code snippet "Snippet #1 version 1.0".
- If you edit "Snippet #1 version 1.0" then the output should be called "Snippet#1 version 1.1" and so on.
- If you move to a new function or piece of code, you should name it "Snippet#2 version 1.0".
- When you interact with the user in natural language, use line breaks, titles, and elegant formatting to ensure a pleasant reading experience.

#### Inputs
- `<objective>` = Generate synthetic text data using the Octoai API and the Mixtral-8x7B model.
- `<programming_language>` = Python 3.11.5
- `<more_specifics>` = Here's the exact syntax you need to use to call the Octoai API and use the Mixtral-8x7B model:
  ```python
  import octoai
  from octoai.client import Client

  client = Client(Token)
  completion = client.chat.completions.create(
    messages=[
          {
              "role": "system",
              "content": "You're a helpful assistant!"
          },
          {
              "role": "user",
              "content": "What's the best way to enjoy a cold coffee brew?"
          }
      ],
      model="mixtral-8x7b-instruct-fp16",
      max_tokens=1000,
      presence_penalty=0,
      temperature=0.1,
      top_p=0.9,
  )

  output = json.dumps(completion.dict(), indent=2)
  print(output)


# Import your token for OctoAI

In [3]:
file_path = "your_file_path_here.txt"  
with open(file_path, 'r') as file:
    Token = file.read() 
#you can also copy-paste your OctoAI token into a variable called "Token."
# it would look something like this Token = "paste your OctoAI API token here"

# Copy-paste the OctoAI syntax for later usage

In [None]:
# The following snippet demonstrates how to interact with the OctoAI API using the Client class.
# It shows the setup for sending a conversation with specific roles and content to the API,
# and how to receive a response from a designated model.

import octoai
from octoai.client import Client

client = Client(Token)
completion = client.chat.completions.create(
  messages=[
        {
            "role": "system",
            "content": "You're a helpful assistant!"
        },
        {
            "role": "user",
            "content": "What's the best way to enjoy a cold coffee brew?"
        }
    ],
    model="mixtral-8x7b-instruct-fp16",
    max_tokens=1000,
    presence_penalty=0,
    temperature=0.1,
    top_p=0.9,
)

output = json.dumps(completion.dict(), indent=2)
print(output)

In the previous code snippet, we illustrate the process of interfacing with the OctoAI API to make a request and retrieve a response using a specific model. Here's how the code works:

1. We import the `octoai` package and the `Client` class from `octoai.client` to use the API client functionality.
2. A `Client` object is instantiated with an authentication token, which establishes a session with the OctoAI API.
3. The `chat.completions.create` method is called on the `client` object to send a request. This request includes:
   - A list of `messages`, each with a `role` and `content`, representing the conversation context sent to the model.
   - The `model` parameter specifies the model identifier to be used for the completion.
   - `max_tokens`, `presence_penalty`, `temperature`, and `top_p` are parameters controlling the completion's length and creativity.
4. The response from the API is then converted to a JSON string using `json.dumps`, with formatting for readability.
5. Finally, the output is printed, showing the model-generated response in a structured JSON format.

This code serves as a basic example of using the OctoAI API to initiate a dialog with the model and obtain a completion, illustrating how to programmatically interact with AI models for text generation.


# Prepare your system and user prompts (basic/static prompts)

In [4]:
# Example of a system prompt

system_prompt= """

Act as an expert Data Scientist who specializes in creating synthetic data.

* Objective: Create a table of fictional audience members reviewing various computer hardware products.
The number of audience members is specified by the user.

* Table Columns:
1. Name: First and last name of the audience member. Ensure ethnic diversity in the names to reflect a global audience.
2. Email: Should follow the format `[firstname].[lastname]@fakemail.com`.
3. Product Name: Choose from VR headsets, Smartphones, Headsets, Mice, Keyboards, Loudspeakers, or Screens.
4. Product Category: Create a unique name for a product in the chosen category.
5. Comment: Provide a constructive comment about the product. Include aspects such as design, functionality, or user experience.

* Instructions:
- Audience Member Details:
  - Names should be diverse and creative, reflecting ethnic diversity to represent a global audience.
  - Email addresses must use the specified format and be unique for each audience member.
- Product and Comment:
  - Each comment should be relevant to the product category and offer insights into the product's features, benefits, or areas for improvement.
  - Fictional product names should be imaginative and reflect the product category.

* Example Entry:
- Name: Iulia Patel
- Email Address: Iulia.patel@fakemail.com
- Product Category: VR headset
- Product Name: VisionQuest 360 
- Detailed Comment: "The VisionQuest 360 offers an immersive experience with its high-resolution display and comfortable fit. However, improving the battery life could make it even more appealing for extended gaming sessions."
"""

# Example of a user prompt
user_prompt = "Create a table with 5 rows based on your instructions. Make the data into a CSV format with ';' as separator. Refrain from adding any further text. Only output the data please."

# Make a test call to generate the first outputs

In [5]:
from octoai.client import Client

client = Client(Token)
completion = client.chat.completions.create(
  messages=[
        {
            "role": "system",
            "content": system_prompt
        },
        {
            "role": "user",
            "content": user_prompt
        }
    ],
    model="mixtral-8x7b-instruct-fp16",
    max_tokens=1000,
    presence_penalty=0,
    temperature=0.1,
    top_p=0.9,
)


output = json.dumps(completion.dict(), indent=2)
print(output)

{
  "id": "chatcmpl-a6f1b5d27bfa4f699ed91a41b8dc5cbf",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": " \"Name\" ; \"Email Address\" ; \"Product Category\" ; \"Fictional Product Name\" ; \"Comment\"\n\"Hiroshi Yamamoto\" ; \"Hiroshi.yamamoto@fakemail.com\" ; \"Smartphones\" ; \"GalaxyS10 Zen\" ; \"The GalaxyS10 Zen has a sleek design and impressive camera, but the battery life could be better.\"\n\"Clara Oluwasegun\" ; \"Clara.oluwasegun@fakemail.com\" ; \"Headsets\" ; \"SoundSerenity Pro\" ; \"The SoundSerenity Pro headset delivers clear audio and a comfortable fit, making it perfect for long listening sessions.\"\n\"Lukas Schmidt\" ; \"Lukas.schmidt@fakemail.com\" ; \"Mice\" ; \"PrecisionPointer X1\" ; \"The PrecisionPointer X1 mouse has excellent precision and customizable buttons, but the scroll wheel feels a bit stiff.\"\n\"Mei Chen\" ; \"Mei.chen@fakemail.com\" ; \"Loudspeakers\" ; \"SonicBoost Max\" ; \"The SonicBoost Ma

# Parse the JSON output from Octoai

### Example of a short prompt used during the "NARRATE" step
#### Parsing the output JSON object

After running the code, I obtained a JSON object. Now, I need to generate a code snippet that can parse this JSON object to extract the "content" field.

**Instruction:**
Please create a code snippet that:
- Parses the JSON object.
- Retrieves the "content" from the parsed object.
- Includes necessary packages.
- Contains clear comments explaining the code.


In [6]:
# Parse the JSON
data = json.loads(output)

# Extract table content from the 'content' field
table_content = data['choices'][0]['message']['content']
print (table_content)

 "Name" ; "Email Address" ; "Product Category" ; "Fictional Product Name" ; "Comment"
"Hiroshi Yamamoto" ; "Hiroshi.yamamoto@fakemail.com" ; "Smartphones" ; "GalaxyS10 Zen" ; "The GalaxyS10 Zen has a sleek design and impressive camera, but the battery life could be better."
"Clara Oluwasegun" ; "Clara.oluwasegun@fakemail.com" ; "Headsets" ; "SoundSerenity Pro" ; "The SoundSerenity Pro headset delivers clear audio and a comfortable fit, making it perfect for long listening sessions."
"Lukas Schmidt" ; "Lukas.schmidt@fakemail.com" ; "Mice" ; "PrecisionPointer X1" ; "The PrecisionPointer X1 mouse has excellent precision and customizable buttons, but the scroll wheel feels a bit stiff."
"Mei Chen" ; "Mei.chen@fakemail.com" ; "Loudspeakers" ; "SonicBoost Max" ; "The SonicBoost Max loudspeakers deliver powerful sound and a stylish design, but the bass could be more pronounced."
"Sofia Rodriguez" ; "Sofia.rodriguez@fakemail.com" ; "Screens" ; "PixelPerfect Vision X2" ; "The PixelPerfect Visio

# Transform the parsed data into a dataframe

### Example of a short prompt during the "Narrate" step
#### Converting the "content" into a DataFrame

After running the code, I received a string output. Now, I need to generate a code snippet that will:
- Convert the parsed data from the "content" string into a DataFrame.
- Name the DataFrame as `df`.

**Request:**
Please include in the code snippet:
- Necessary packages for the operation.
- Clear comments explaining each part of the code.


In [7]:
from io import StringIO
# Use StringIO to simulate reading from a file
parsed_data = StringIO(table_content)

# Read the data into a pandas DataFrame using ";" as the column separator
df = pd.read_csv(parsed_data, sep=';')

# Display the DataFrame
df

Unnamed: 0,"""Name""","""Email Address""","""Product Category""","""Fictional Product Name""","""Comment"""
0,Hiroshi Yamamoto,"""Hiroshi.yamamoto@fakemail.com""","""Smartphones""","""GalaxyS10 Zen""","""The GalaxyS10 Zen has a sleek design and imp..."
1,Clara Oluwasegun,"""Clara.oluwasegun@fakemail.com""","""Headsets""","""SoundSerenity Pro""","""The SoundSerenity Pro headset delivers clear..."
2,Lukas Schmidt,"""Lukas.schmidt@fakemail.com""","""Mice""","""PrecisionPointer X1""","""The PrecisionPointer X1 mouse has excellent ..."
3,Mei Chen,"""Mei.chen@fakemail.com""","""Loudspeakers""","""SonicBoost Max""","""The SonicBoost Max loudspeakers deliver powe..."
4,Sofia Rodriguez,"""Sofia.rodriguez@fakemail.com""","""Screens""","""PixelPerfect Vision X2""","""The PixelPerfect Vision X2 screen offers stu..."


# Save your data into an Excel file (or CSV file)

In [9]:
file_path = "your_file_path_here.xlsx"

# Save the DataFrame to an Excel file
df.to_excel(file_path, index=True)

print(f'DataFrame saved to file path')

DataFrame saved to file path


# Correct and improve your prompts and code

### Example of Improvements Done During the "Screen" Step

During the verification of the code and the generated data, I identified several issues that needed addressing:

1. **Unstable Output Format**:
   - **Problem**: The output's separator was inconsistent, alternating between commas and semi-colons.
   - **Solution**: I updated the prompt to specify a clear format instruction, ensuring uniformity in the output.
<br><br>
2. **Incorrect Feature in System Prompt**:
   - **Problem**: The system prompt mistakenly included a "Product Name" feature instead of a "Rating" feature.
   - **Solution**: I corrected this by replacing "Product Name" with "Rating" in the prompt to align with the intended data structure.
<br><br>
3. **Insufficient Token Limit**:
   - **Problem**: The original `max_tokens` setting was too low, failing to generate 5 rows of data.
   - **Solution**: I increased the `max_tokens` value to 1,000 to provide a comfortable margin and ensure complete data generation.



In [None]:
#Improvements made inside the SYSTEM PROMPT
#Notice the "Format" section

system_prompt= """

Act as an expert Data Scientist who specializes in creating synthetic data.

*Objective: Create a table of fictional audience members reviewing various computer hardware products.
The number of audience members is specified by the user.

*Table Columns:
1. Name: First and last name of the audience member. Ensure ethnic diversity in the names to reflect a global audience.
2. Email Address: Should follow the format `[firstname].[lastname]@fakemail.com`.
3. Product reviewed: Choose from VR headsets, Smartphones, Headsets, Mice, Keyboards, Loudspeakers, or Screens.
4. Product Rating: Assign a rating from 1 to 5 stars.
5. Comment: Provide a constructive comment about the product. Include aspects such as design, functionality, or user experience.

*Instructions:
- Audience Member Details:
  - Names should be diverse and creative, reflecting ethnic diversity to represent a global audience.
  - Email addresses must use the specified format and be unique for each audience member.
- Product and Comment:
  - Each comment should be relevant to the product category and offer insights into the product's features, benefits,or areas for improvement.
  - Fictional product names should be imaginative and reflect the product category.

*Format:
- The content should be structured as a table with semicolon-separated values (CSV format). Make sure to start with "Name;Email Address;Product Reviewed;Product Rating;Comment" and then continue.
- Each entry must contain the following columns in order: Name; Email Address; Product Reviewed; Product Rating; Comment.
- Product ratings should be expressed as integers from 1 to 5, where 1 represents the least satisfaction and 5 represents the highest satisfaction.

*Example Entry:
- Name: Iulia Patel
- Email Address: Iulia.patel@fakemail.com
- Product Reviewed: VisionQuest 360
- Product Rating: 4
- Comment: "The VisionQuest 360 offers an immersive experience with its high-resolution display and comfortable fit. However, improving the battery life could make it even more appealing for extended gaming sessions."
"""

user_prompt = "Create a table with 5 rows based on your instructions. Make the data into a CSV format with ';' as separator. Refrain from adding any further text. Only output the data please."

### Implement the improvement for the code

In [None]:
#Improvements made inside the code
#Notice the max_tokens value

from octoai.client import Client
client = Client(Token)

completion = client.chat.completions.create(
  messages=[
        {
            "role": "system",
            "content": system_prompt #system_prompt is now a variable that contains the elaborated prompt you just wrote.
        },
        {
            "role": "user",
            "content": user_prompt #Same story with user_prompt.
        }
    ],
    model="mixtral-8x7b-instruct-fp16",
    max_tokens=1000, # changed from 500 to 1,000
    presence_penalty=0,
    temperature=0.1,
    top_p=0.9,
)
output = json.dumps(completion.dict(), indent=2)
print(output)

# Here we'll get a bit creative to improve our system prompt during the "Polish" step.

First we'll create a JSON object that stories 50 fictional products (generated using an LLM). <br>
Then we'll use the JSON as a pool of options that we can dynamically choose a sample from and insert that sample into the system prompt. <br>
The idea is to make your system prompt about specific products (that already exist in your JSON) instead of generating them from scratch every time. <br>
We'll also introduce a "batch_size" variable that includes how many rows the system should output each time. I usually use batches of "5" <br>
You'll understand once you see the improved system prompt.

## Create a JSON file with 50 fictional products from 10 different fictional brands

In [10]:
import json

# This is where your product dictionaries will be defined.
# The list was generated using an LLM and a few prompts to format the data
products = [
    {"brand": "NebulaTech", "product": "Nebula VR Headset Pro", "description": "A high-end VR headset for immersive experiences."},
    {"brand": "NebulaTech", "product": "Nebula Phone X1", "description": "A flagship smartphone with cutting-edge features."},
    {"brand": "NebulaTech", "product": "Nebula Gaming Headset", "description": "A headset designed for gamers with superior sound quality."},
    {"brand": "NebulaTech", "product": "Nebula Precision Mouse", "description": "A mouse engineered for accuracy and performance."},
    {"brand": "NebulaTech", "product": "Nebula Ergo Keyboard", "description": "An ergonomic keyboard for comfort and productivity."},
    {"brand": "AstraGear", "product": "Astra VR Glasses 2.0", "description": "Sleek and comfortable VR glasses for extended wear."},
    {"brand": "AstraGear", "product": "Astra Phone D2", "description": "A durable and reliable smartphone for everyday use."},
    {"brand": "AstraGear", "product": "Astra Sound Headset", "description": "A versatile headset offering crystal-clear audio."},
    {"brand": "AstraGear", "product": "Astra Glide Mouse", "description": "A smooth and responsive mouse for both gaming and professional use."},
    {"brand": "AstraGear", "product": "Astra Slim Keyboard", "description": "A sleek and portable keyboard for users on the go."},
    {"brand": "OrionWares", "product": "Orion VR Ultimate", "description": "The ultimate VR experience with unparalleled visuals and tracking."},
    {"brand": "OrionWares", "product": "Orion Phone Z", "description": "A stylish smartphone with a focus on performance and camera quality."},
    {"brand": "OrionWares", "product": "Orion Surround Headset", "description": "A headset offering immersive audio with surround sound."},
    {"brand": "OrionWares", "product": "Orion Speed Mouse", "description": "A fast and lightweight mouse for competitive gaming."},
    {"brand": "OrionWares", "product": "Orion Mecha Keyboard", "description": "A durable mechanical keyboard with tactile feedback."},
    {"brand": "ZephyrTech", "product": "Zephyr VR Lite", "description": "A lightweight and affordable VR headset for beginners."},
    {"brand": "ZephyrTech", "product": "Zephyr Phone Ultra", "description": "An ultra-thin smartphone with premium features."},
    {"brand": "ZephyrTech", "product": "Zephyr NC Headset", "description": "A noise-cancelling headset for clear audio in any environment."},
    {"brand": "ZephyrTech", "product": "Zephyr Gamer Mouse", "description": "A mouse designed for hardcore gamers with customizable buttons."},
    {"brand": "ZephyrTech", "product": "Zephyr Compact Keyboard", "description": "A space-saving keyboard with full functionality."},
    {"brand": "VortexElectronics", "product": "Vortex VR Premium", "description": "A premium VR headset with exclusive content access."},
    {"brand": "VortexElectronics", "product": "Vortex Phone Max", "description": "A large-screen smartphone for multimedia consumption."},
    {"brand": "VortexElectronics", "product": "Vortex Studio Headset", "description": "A professional-grade headset for music and gaming."},
    {"brand": "VortexElectronics", "product": "Vortex Gaming Mouse", "description": "A high-precision mouse tailored for gamers."},
    {"brand": "VortexElectronics", "product": "Vortex Ergo Keyboard", "description": "An ergonomic keyboard designed for long gaming sessions."},
    {"brand": "QuantumDevices", "product": "Quantum VR Eye", "description": "A VR headset with eye-tracking technology for interactive experiences."},
    {"brand": "QuantumDevices", "product": "Quantum Phone AI", "description": "A smartphone with advanced AI capabilities for smart assistance."},
    {"brand": "QuantumDevices", "product": "Quantum BT Headset", "description": "A wireless Bluetooth headset for freedom of movement."},
    {"brand": "QuantumDevices", "product": "Quantum Vertical Mouse", "description": "An ergonomically designed mouse to reduce wrist strain."},
    {"brand": "QuantumDevices", "product": "Quantum Backlit Keyboard", "description": "A keyboard with backlit keys for late-night sessions."},
    {"brand": "EchoInnovations", "product": "Echo VR Pro", "description": "A professional-grade VR headset for developers and enthusiasts."},
    {"brand": "EchoInnovations", "product": "Echo Secure Phone", "description": "A smartphone with enhanced security features for privacy-conscious users."},
    {"brand": "EchoInnovations", "product": "Echo Dynamic Headset", "description": "A versatile headset with dynamic sound adjustment."},
    {"brand": "EchoInnovations", "product": "Echo Clicky Mouse", "description": "A mouse with satisfying click feedback for productivity."},
    {"brand": "EchoInnovations", "product": "Echo Wireless Keyboard", "description": "A convenient wireless keyboard for a clutter-free desk."},
    {"brand": "LuminTech", "product": "Lumin VR 4K", "description": "A 4K VR headset for lifelike visuals."},
    {"brand": "LuminTech", "product": "Lumin Eco Phone", "description": "An environmentally friendly smartphone with sustainable materials."},
    {"brand": "LuminTech", "product": "Lumin Waterproof Headset", "description": "A durable headset that's perfect for outdoor adventures."},
    {"brand": "LuminTech", "product": "Lumin Flashy Mouse", "description": "A stylish mouse with customizable LED lighting."},
    {"brand": "LuminTech", "product": "Lumin Comfy Keyboard", "description": "A keyboard designed for comfort with soft-touch keys."},
    {"brand": "TerraTech", "product": "Terra VR Outdoor", "description": "A rugged VR headset designed for outdoor use."},
    {"brand": "TerraTech", "product": "Terra Phone S", "description": "A smartphone with superior durability and outdoor functionality."},
    {"brand": "TerraTech", "product": "Terra Ambient Headset", "description": "A headset with ambient sound for safe outdoor listening."},
    {"brand": "TerraTech", "product": "Terra Precision Mouse", "description": "A mouse designed for precision in design and gaming."},
    {"brand": "TerraTech", "product": "Terra Weatherproof Keyboard", "description": "A durable keyboard resistant to water and dust."},
    {"brand": "SkylineElectronics", "product": "Skyline VR Panorama", "description": "A VR headset offering panoramic views for immersive experiences."},
    {"brand": "SkylineElectronics", "product": "Skyline Alpha Phone", "description": "A high-end smartphone with alpha-grade performance."},
    {"brand": "SkylineElectronics", "product": "Skyline Bass Headset", "description": "A headset with deep bass for music lovers."},
    {"brand": "SkylineElectronics", "product": "Skyline Gaming Mouse", "description": "A gaming mouse with high DPI settings and ergonomic design."},
    {"brand": "SkylineElectronics", "product": "Skyline Dev Keyboard", "description": "A keyboard with programmable keys for developers."}
]


# Convert the product list to a JSON string, pretty-printed for readability
products_json = json.dumps(products, indent=4)

# Specify the file path where you want to save the JSON file
file_path = "Your_file_path_here."  # Update this path as needed

# Write the JSON string to the file at the specified path
with open(file_path, 'w') as file:
    file.write(products_json)

# Include the "JSON samples" in a function. The function returns dynamic system prompts
Notice the ** Selected Products** section. <br>
Notice the batch_size variable. <br>
We'll also add a ** Example of Final Output ** section to make sure the model outputs a coherent CSV format each time we run it.

In [11]:
import json
import random

def generate_system_prompt(json_file_path, batch_size):
    
    # Load the product data from the specified JSON file
    with open(json_file_path, 'r') as file:
        products_data = json.load(file)

    # Select a random set of products from the list, based on the specified batch size
    selected_products = random.sample(products_data, min(batch_size, len(products_data)))

    # Generate the dynamic system prompt
    system_prompt = f"""Act as an expert Data Scientist who specializes in creating synthetic data.

**Objective:** Create a table of fictional audience members reviewing the following computer hardware products taken from our product catalog. Each review should reflect the product's features and potential user experiences.

**Selected Products:**
""" + "\n".join([f"**Brand:** {product['brand']}, **Product:** {product['product']}, **Description:** {product['description']}" for product in selected_products]) + f"""

**Table Columns:**
1. **Name:** First and last name of the audience member. Ensure ethnic diversity in the names to reflect a global audience.
2. **Email Address:** Should follow the format `[firstname].[lastname]@fakemail.com`.
3. **Product Reviewed:** One of the selected products.
4. **Product Rating:** Assign a rating from 1 to 5 stars.
5. **Comment:** Provide a constructive comment about the product. Include aspects such as design, functionality, user experience, or how the product's features meet the user's needs.

**Instructions:**
- **Audience Member Details:**
  - Names should be diverse and creative, reflecting ethnic diversity to represent a global audience.
  - Email addresses must use the specified format and be unique for each audience member.
- **Product Review:**
  - Each comment should be relevant to the product reviewed and offer insights into the product's features, benefits, or areas for improvement based on the description.
  - Ratings should reflect the reviewer's overall satisfaction with the product. Ratings should be integers that vary from 1 to 5.

**Format:**
- The content should be structured as a table with semicolon-separated values (CSV format). Make sure to start with "Name;Email Address;Product Reviewed;Product Rating;Comment" and then continue.
- Each entry must contain the following columns in order: Name, Email Address, Product Reviewed, Product Rating, Comment.
- Product ratings should be expressed as integers from 1 to 5, where 1 represents the least satisfaction and 5 represents the highest satisfaction.

**Example Entry:**
- Name: Hina Ahmed
- Email Address: hina.ahmed@fakemail.com
- Product Reviewed: SoundSerene Pro
- Product Rating: 5 
- Comment: "The SoundSerene Pro headset delivers exceptional sound quality and noise cancellation in a sleek, portable design, setting a new standard for audio excellence in headsets."

**Example of Final Output:**
"Name;Email Address;Product Reviewed;Product Rating;Comment
Hina Ahmed;hina.ahmed@fakemail.com;SoundSerene Pro;5;The SoundSerene Pro headset delivers exceptional sound quality and noise cancellation in a sleek, portable design, setting a new standard for audio excellence in headsets.
Khalil Al-Farouq;khalil.alfarouq@fakemail.com;Pixel XLt;1;The Pixel XLt struggles with a short battery life and a design that fails to impress, significantly underperforming in key areas expected from modern smartphones.
Min-Ji Kim;min-ji.kim@fakemail.com;SonicBoom X1;3;The SonicBoom X1 offers powerful sound and deep bass, but the high volume settings may distort the audio quality.
Pedro Martínez;pedro.martinez@fakemail.com;GamingGear M1;4;The GamingGear M1 has customizable buttons and a comfortable grip, but the scroll wheel could be smoother for easier navigation.
Radhika Gupta;radhika.gupta@fakemail.com;QuantumQuill R2;4;The QuantumQuill R2 features a durable build and quiet keys, but the key layout may take some time to get used to for touch typists."

"""
    
    return system_prompt

### After running the prompt multiple times, you'll notice that 90% of the reviews are positive. Let's adjust the system prompt function to force it to produce either positive reviews or negative ones *on demand* 

We'll do that by introducing a new variable called "sentiment."

In [12]:
def generate_system_prompt_sentiment(json_file_path, batch_size, sentiment):
    
    """
    Generates a dynamic system prompt for creating synthetic product reviews with specified sentiment.
    
    This function loads product data from a JSON file, selects a random subset of products based on the specified batch size,
    and constructs a prompt directing the creation of synthetic reviews. Each review includes details such as the product's
    brand, description, and a user comment reflecting the specified sentiment (positive, negative, or a balanced mix).
    
    Parameters:
    - json_file_path (str): Path to the JSON file containing product data.
    - batch_size (int): Number of products to include in the prompt.
    - sentiment (str): Desired sentiment for the reviews ('positive', 'negative', or other for a balanced mix).
    
    Returns:
    - str: A formatted string containing the system prompt with instructions for generating the reviews.
    """
    
    # Convert the sentiment input to lowercase for standardized comparison.
    sentiment_lower = sentiment.lower()

    # Assign rating instruction based on the sentiment.
    if sentiment_lower == "positive":
        rating_instruction = "Ensure each review includes a positive sentiment, with ratings of 4 or 5 stars. Comments should highlight satisfaction with the product’s features, user experience, and overall quality."
    elif sentiment_lower == "negative":
        rating_instruction = "Ensure each review includes a negative sentiment, with ratings of 1, 2, or 3 stars. Comments should critically address areas for improvement, dissatisfaction with the product’s features, or user experience issues."
    else:
        rating_instruction = "Alternate between positive reviews (4 or 5 stars, focusing on satisfaction and product strengths) and negative reviews (1 to 3 stars, highlighting areas for improvement and dissatisfaction), ensuring a balanced mix of sentiments."

    # Load the product data from the specified JSON file
    with open(json_file_path, 'r') as file:
        products_data = json.load(file)

    # Select a random set of products from the list, based on the specified batch size
    selected_products = random.sample(products_data, min(batch_size, len(products_data)))

    # Generate the dynamic system prompt
    system_prompt = f"""
Act as an expert Data Scientist who specializes in creating synthetic data.

**Objective: Create a table of fictional audience members reviewing the following computer hardware products taken from our product catalog. Each review should reflect the product's features and potential user experiences.

**Selected Products:
""" + "\n".join([f"*Brand: {product['brand']}, *Product: {product['product']}, *Description: {product['description']}" for product in selected_products]) + f"""

**Table Columns:
1. *Name: First and last name of the audience member. Ensure ethnic diversity in the names to reflect a global audience.
2. *Email Address: Should follow the format `[firstname].[lastname]@fakemail.com`.
3. *Product Reviewed: One of the selected products.
4. *Product Rating: Assign a rating from 1 to 5 stars.
5. *Comment: Provide a constructive comment about the product. Include aspects such as design, functionality, user experience, or how the product's features meet the user's needs.

** Specific instructions:
- *Audience Member Details:
  - Names should be diverse and creative, reflecting ethnic diversity to represent a global audience.
  - Email addresses must use the specified format and be unique for each audience member.
- *Product Review:
  - Each comment should be relevant to the product reviewed and offer insights into the product's features, benefits, or areas for improvement based on the description.
  - {rating_instruction}

**Format:
- The content should be structured as a table with semicolon-separated values (CSV format). Make sure to start with "Name;Email Address;Product Reviewed;Product Rating;Comment" and then continue.
- Each entry must contain the following columns in order: Name, Email Address, Product Reviewed, Product Rating, Comment.
- Product ratings should be expressed as integers from 1 to 5, where 1 represents the least satisfaction and 5 represents the highest satisfaction.

**Example Entry:
- Name: Hina Ahmed
- Email Address: hina.ahmed@fakemail.com
- Product Reviewed: SoundSerene Pro
- Product Rating: 5 
- Comment: "The SoundSerene Pro headset delivers exceptional sound quality and noise cancellation in a sleek, portable design, setting a new standard for audio excellence in headsets."

**Example of Final Output: Name;Email Address;Product Reviewed;Product Rating;Comment
Hina Ahmed;hina.ahmed@fakemail.com;SoundSerene Pro;5;The SoundSerene Pro headset delivers exceptional sound quality and noise cancellation in a sleek, portable design, setting a new standard for audio excellence in headsets.
Khalil Al-Farouq;khalil.alfarouq@fakemail.com;Pixel XLt;1;The Pixel XLt struggles with a short battery life and a design that fails to impress, significantly underperforming in key areas expected from modern smartphones.
Min-Ji Kim;min-ji.kim@fakemail.com;SonicBoom X1;3;The SonicBoom X1 offers powerful sound and deep bass, but the high volume settings may distort the audio quality.
Pedro Martínez;pedro.martinez@fakemail.com;GamingGear M1;4;The GamingGear M1 has customizable buttons and a comfortable grip, but the scroll wheel could be smoother for easier navigation.
Radhika Gupta;radhika.gupta@fakemail.com;QuantumQuill R2;4;The QuantumQuill R2 features a durable build and quiet keys, but the key layout may take some time to get used to for touch typists.

"""
    
    return system_prompt

### Generate System Prompt for Sentiment Analysis - Comments

The previous function, `generate_system_prompt_sentiment`, dynamically creates system prompts for generating synthetic product reviews with a specified sentiment. Here's how it works:

1. **Load Product Data**: It starts by reading product data from a JSON file, specified by `json_file_path`.
2. **Random Product Selection**: A subset of products is randomly selected based on `batch_size`, ensuring varied data for the reviews.
3. **Sentiment Handling**: Depending on the `sentiment` argument, it prepares instructions to generate reviews with positive, negative, or a balanced mix of sentiments.
4. **Dynamic Prompt Construction**: A detailed system prompt is constructed, outlining the task of creating synthetic reviews, including instructions for the review's tone and content.
   
#### Parameters:
- `json_file_path` (str): Path to the JSON file with product data.
- `batch_size` (int): Number of products to process.
- `sentiment` (str): Target sentiment for the reviews ('positive', 'negative', or other for a mix).

#### Returns:
- A string that contains the system prompt with detailed instructions for generating synthetic reviews.

The prompt instructs the generation of a table listing fictional audience members. For each audience member, there's information about the selected products, their brand, product name, rating, and comment. The prompt indicates the need for diverse and ethnically varied names, appropriate email formats, and comments that reflect real user experiences tied to the given sentiment.

The function  uses Python's string formatting and control structures to tailor the output, allowing us to integrate data processing and dynamic text generation.


### Example of calling the generate_system_prompt_sentiment function

In [13]:
# Usage example with a specified batch size
json_file_path = "your_file_path_here.json"  # Path to your JSON file
batch_size = 5  # Specify the number of products you want to include in the prompt
sentiment = "negATIVe"
system_prompt = generate_system_prompt_sentiment(json_file_path, batch_size, sentiment)
print(system_prompt)


Act as an expert Data Scientist who specializes in creating synthetic data.

**Objective: Create a table of fictional audience members reviewing the following computer hardware products taken from our product catalog. Each review should reflect the product's features and potential user experiences.

**Selected Products:
*Brand: ZephyrTech, *Product: Zephyr NC Headset, *Description: A noise-cancelling headset for clear audio in any environment.
*Brand: ZephyrTech, *Product: Zephyr Gamer Mouse, *Description: A mouse designed for hardcore gamers with customizable buttons.
*Brand: ZephyrTech, *Product: Zephyr VR Lite, *Description: A lightweight and affordable VR headset for beginners.
*Brand: TerraTech, *Product: Terra Precision Mouse, *Description: A mouse designed for precision in design and gaming.
*Brand: NebulaTech, *Product: Nebula VR Headset Pro, *Description: A high-end VR headset for immersive experiences.

**Table Columns:
1. *Name: First and last name of the audience member. E

# Just for fun (and safety), make the user_prompt dynamic with the same batch_size specified in the system_prompt

In [14]:
def generate_user_prompt(batch_size):
    

    # Generate the dynamic user prompt
    user_prompt = f"""Create a table with {batch_size} fictional user reviews based on your instructions. The content should be structured as a table with semicolon-separated values (CSV format). Make sure to start with "Name;Email Address;Product Reviewed;Product Rating;Comment" and then continue. Only output the column names and the data without additional text please."""

    return user_prompt

In [15]:
# Usage example
batch_size = 5  # This can be any number depending on your requirements
user_prompt = generate_user_prompt(batch_size)
print(user_prompt)

Create a table with 5 fictional user reviews based on your instructions. The content should be structured as a table with semicolon-separated values (CSV format). Make sure to start with "Name;Email Address;Product Reviewed;Product Rating;Comment" and then continue. Only output the column names and the data without additional text please.


# Now we can write a data-generation code that calls the dynamic prompts using a "for" loop

In [17]:
#inputs 
client = Client(Token)
dataset_size = 42
batch_size = 5
sentiment = "negative"

# Placeholder for the aggregated results
all_data = []

# Calculate the total number of batches, rounding up to include partial batches
total_batches = math.ceil(dataset_size / batch_size)

for batch_num in range(total_batches):
    # Adjust the batch size for the last batch, if necessary
    current_batch_size = batch_size if (batch_num + 1) * batch_size <= dataset_size else dataset_size % batch_size
    
    # Call the octoai API to generate a batch of synthetic data 
    completion = client.chat.completions.create(
        messages=[
            {"role": "system", "content": generate_system_prompt_sentiment(json_file_path, current_batch_size, sentiment)},
            {"role": "user", "content": generate_user_prompt(current_batch_size)}
        ],
        model="mixtral-8x7b-instruct-fp16", #pick the model you want from octoai's 
        max_tokens=1500,       # Maximum number of tokens to generate in the completion
        presence_penalty=0,    # No penalty for repeating information, allowing some repetition
        temperature=0.1,       # Low creativity, the model will produce more deterministic outputs
        top_p=0.9,             # Use top 90% of probability distribution, balancing creativity and relevance
    )
    
    # Parsing the JSON output from the API
    output = json.dumps(completion.dict(), indent=2)  # Convert API response to JSON string
    data = json.loads(output)  # Parse the JSON string into a Python dictionary
    
    # Extract table content from the 'content' field
    table_content = data['choices'][0]['message']['content']
    
    # Convert the generated batch data to a DataFrame and append it to 'all_data'
    parsed_data = StringIO(table_content)
    batch_df = pd.read_csv(parsed_data, sep=';')
    
   # Check if the batch_df has more rows than expected and adjust if necessary
    if len(batch_df) > current_batch_size:
        batch_df = batch_df.iloc[:current_batch_size]
    
    # Append the batch DataFrame to the list of all data
    all_data.append(batch_df)

# Concatenate all batch DataFrames into a single DataFrame
final_df = pd.concat(all_data, ignore_index=True)

# Optional: Verify the final row count matches the expected dataset_size
if len(final_df) != dataset_size:
    print(f"Warning: Expected {dataset_size} rows, but got {len(final_df)} rows.")
    
# Save the final DataFrame to an Excel file
output_file_path = "Your_file_path_here."
final_df.to_excel(output_file_path, index=True)

# Display an end-of-task message
print(f"DataFrame with {dataset_size} rows saved to output file path")

DataFrame with 42 rows saved to output file path


## Check your dataframe 

In [18]:
final_df.sample(9)

Unnamed: 0,Name,Email Address,Product Reviewed,Product Rating,Comment
26,Ivan Petrov,ivan.petrov@fakemail.com,Echo Dynamic Headset,1,The Echo Dynamic Headset's dynamic sound adjus...
39,Sophia Johnson,sophia.johnson@fakemail.com,Nebula VR Headset Pro,3,The Nebula VR Headset Pro delivers immersive e...
19,Talia Lee,talia.lee@fakemail.com,Lumin Comfy Keyboard,2,"The Lumin Comfy Keyboard is comfortable, but t..."
5,Jamila Nguyen,jamila.nguyen@fakemail.com,Orion Speed Mouse,2,"The Orion Speed Mouse is indeed fast, but its ..."
25,Ella Nguyen,ella.nguyen@fakemail.com,Lumin Waterproof Headset,2,The Lumin Waterproof Headset has a solid build...
22,Claudia Sánchez,claudia.sanchez@fakemail.com,Terra VR Outdoor,3,The Terra VR Outdoor headset is sturdy and gre...
8,Kareem Olawale,kareem.olawale@fakemail.com,Astra Phone D2,2,"The Astra Phone D2 is durable and reliable, bu..."
15,Jamila Nguyen,jamila.nguyen@fakemail.com,Echo VR Pro,2,"The Echo VR Pro has impressive graphics, but t..."
16,Luca Silva,luca.silva@fakemail.com,Quantum Vertical Mouse,1,The Quantum Vertical Mouse has a strange grip ...


## Options for generating your final dataset

You can:
- Rerun the code with "positive" sentiment and then concatenate both of your dataframes.
  
**OR**

- Turn your code into a function for more flexibility.


# This is an example of what happens during the "Integrate" step.
We combine the previous scripts into a function that generates synthetic data with multiple variables including "sentiment."
The idea is to change the variables to be able to create a "negative" dataset and a "positive" one.
From there you can combine the outputs of your function into a single DataFrame.

In [21]:
def generate_synthetic_dataset(dataset_size, batch_size, sentiment, json_file_path, output_file_path, Token):
    """
    Generates a synthetic dataset using the OctoAI API.

    Args:
    dataset_size (int): The desired size of the dataset.
    batch_size (int): The number of entries to generate in each batch.
    sentiment (str): The sentiment of the dataset to be generated ('positive', 'negative', etc.).
    json_file_path (str): The path to the JSON file required for generating system prompts.
    output_file_path (str): The path where the generated dataset will be saved as an Excel file.

    Returns:
    pandas.DataFrame: The final DataFrame containing the synthetic dataset.
    """
    
    # Placeholder for the aggregated results
    all_data = []

    # Calculate the total number of batches, rounding up to include partial batches
    total_batches = math.ceil(dataset_size / batch_size)
    client = Client(Token)  # Initialize the OctoAI client

    for batch_num in range(total_batches):
        # Adjust the batch size for the last batch, if necessary
        current_batch_size = batch_size if (batch_num + 1) * batch_size <= dataset_size else dataset_size % batch_size
        
        # Call the octoai API to generate a batch of synthetic data 
        completion = client.chat.completions.create(
            messages=[
                {"role": "system", "content": generate_system_prompt_sentiment(json_file_path, current_batch_size, sentiment)},
                {"role": "user", "content": generate_user_prompt(current_batch_size)}
            ],
            model="mixtral-8x7b-instruct-fp16",  # Pick the model you want from octoai's 
            max_tokens=1500,  # Maximum number of tokens to generate in the completion
            presence_penalty=0,  # No penalty for repeating information, allowing some repetition
            temperature=0.1,  # Low creativity, the model will produce more deterministic outputs
            top_p=0.9,  # Use top 90% of probability distribution, balancing creativity and relevance
        )
    
        # Parsing the JSON output from the API
        output = json.dumps(completion.dict(), indent=2)  # Convert API response to JSON string
        data = json.loads(output)  # Parse the JSON string into a Python dictionary

        # Extract table content from the 'content' field
        table_content = data['choices'][0]['message']['content']

        # Convert the generated batch data to a DataFrame and append it to 'all_data'
        parsed_data = StringIO(table_content)
        batch_df = pd.read_csv(parsed_data, sep=';')

        # Check if the batch_df has more rows than expected and adjust if necessary
        if len(batch_df) > current_batch_size:
            batch_df = batch_df.iloc[:current_batch_size]

        # Append the batch DataFrame to the list of all data
        all_data.append(batch_df)

    # Concatenate all batch DataFrames into a single DataFrame
    final_df = pd.concat(all_data, ignore_index=True)

    # Optional: Verify the final row count matches the expected dataset_size
    if len(final_df) != dataset_size:
        print(f"Warning: Expected {dataset_size} rows, but got {len(final_df)} rows.")

    # Save the final DataFrame to an Excel file
    final_df.to_excel(output_file_path, index=False)

    # Display an end-of-task message
    print(f"DataFrame with {dataset_size} rows saved to output_file_path")

    return final_df

### Generate Synthetic Dataset Function during the "Integrate" step.

This function creates a synthetic dataset using the OctoAI API. Here's how it works:

- **What it Does**: It generates a set of data entries based on the given sentiment, like positive or negative.
- **How it Works**:
  - The function takes in the size of the dataset you want, how many entries to create at a time (batch size), and the sentiment for the data.
  - It needs a JSON file to start generating system prompts and will save the final data as an Excel file.
- **Process**:
  1. The function calculates how many batches it needs to create the entire dataset.
  2. For each batch, it uses the OctoAI API to generate data entries.
  3. This data is then turned into a DataFrame.
  4. All these small tables (DataFrames) from each batch are combined into one big table.
- **Output**:
  - The function saves this big table as an Excel file at the given location.
  - It returns this table as a DataFrame, so you can use it right away in your code.

- **Parameters Explained**:
  - `dataset_size`: Total number of entries you want in the dataset.
  - `batch_size`: How many entries to generate at one time.
  - `sentiment`: Type of sentiment for the dataset (like positive, negative, etc.).
  - `json_file_path`: Where to find the JSON file for generating prompts.
  - `output_file_path`: Where to save the generated Excel file.

This function is designed to automatically create large amounts of text data with specific characteristics.


# Now it's time to run the code.

In [22]:
# First you fill in the inputs
Token = Token
dataset_size = 1042
batch_size = 5
sentiment_df_negative= "negative"
output_file_path_negative = "your_file_path_here/negative_reviews_dataset.xlsx"
sentiment_df_positive= "positive"
output_file_path_positive = "your_file_path_here/positive_reviews_dataset.xlsx"
json_file_path = "your_file_path_here/products.json"

In [25]:
client = Client(Token)
df_positive = generate_synthetic_dataset(dataset_size, batch_size, sentiment_df_positive, json_file_path, output_file_path_positive, Token)
df_negative = generate_synthetic_dataset(dataset_size, batch_size, sentiment_df_negative, json_file_path, output_file_path_negative, Token)

DataFrame with 1042 rows saved to output_file_path
DataFrame with 1042 rows saved to output_file_path


In [26]:
combined_df_final = pd.concat([df_positive, df_negative], ignore_index=True)
combined_df_final.sample(10)
# and voilà

Unnamed: 0,Name,Email Address,Product Reviewed,Product Rating,Comment
870,Xiao-Li Wang,xiao-li.wang@fakemail.com,Vortex VR Premium,5,The Vortex VR Premium offers an immersive expe...
1448,Karla Gómez,karla.gomez@fakemail.com,Zephyr Gamer Mouse,1,The Zephyr Gamer Mouse has customizable button...
1036,Sophia Johnson,sophia.johnson@fakemail.com,Vortex Ergo Keyboard,5,The Vortex Ergo Keyboard provides exceptional ...
890,Xiao-Li Wang,xiao-li.wang@fakemail.com,Echo Dynamic Headset,5,The Echo Dynamic Headset offers outstanding so...
1856,Karl Müller,karl.muller@fakemail.com,Quantum Backlit Keyboard,1,The Quantum Backlit Keyboard has customizable ...
1161,Mia Kim,mia.kim@fakemail.com,Echo Clicky Mouse,2,The Echo Clicky Mouse has satisfying click fee...
424,Amina Olusegun,amina.olusegun@fakemail.com,Zephyr Gamer Mouse,5,The Zephyr Gamer Mouse has customizable button...
78,Mei-Ying Chen,mei-ying.chen@fakemail.com,Lumin Flashy Mouse,5,The Lumin Flashy Mouse is a stylish and custom...
311,Li Wei,li.wei@fakemail.com,Skyline Dev Keyboard,5,The Skyline Dev Keyboard is a game-changer for...
222,Laila El-Rashid,laila.el-rashid@fakemail.com,Astra Slim Keyboard,5,The Astra Slim Keyboard is a game-changer for ...


In [30]:
final_output_file_path = "your_file_path_here.xlsx"
combined_df_final.to_excel(final_output_file_path, index=True)
 # Display an end-of-task message
print(f"DataFrame saved to final_output_file_path")

DataFrame saved to final_output_file_path


# Here you can "Restart" another loop to conduct an exploratory data analysis.

### Example #1: Analyzing Product Review Frequency

**Task**: Conduct an analysis for Product Review Frequency.

**Steps**:
1. **Count the Number of Reviews for Each Product**:
   - Purpose: To determine how frequently each product is reviewed, which can indicate its popularity or consumer interest.
<br><br>
2. **Visualize the Data**:
   - Action: Create a plot to display the frequency of reviews for each product.
   - Benefit: This makes it easier to visually compare the review frequency of the products.


In [None]:
# 1. Count the number of reviews for each product
product_review_count = combined_df_final['Product Reviewed'].value_counts()

# 2. Visualize the data
plt.figure(figsize=(12, 15))
sns.barplot(x=product_review_count.values, y=product_review_count.index)
plt.title("Product Review Frequency")
plt.xlabel("Number of Reviews")
plt.ylabel("Product Reviewed")
plt.tight_layout()  # Adjust layout to make full use of space
plt.show()

### Task Example #2: Text Analysis on the "Comment" Column

**Task**: Perform text analysis on the "Comment" column to extract common themes or features mentioned in the reviews. Use a word cloud for this, as it's an effective visual tool for highlighting the most frequent words.

**Steps**:
1. **Combine All Comments**:
   - Action: Concatenate all the text from the "Comment" column into a single large text string.
   - Purpose: To prepare the data for analysis as a unified text corpus.
<br><br>
2. **Text Preprocessing**:
   - Action: Remove common stopwords (like "the", "a", "in") that don’t add significant meaning to the text analysis.
   - Purpose: To clean the text, ensuring the word cloud reflects meaningful words only.
<br><br>
3. **Generate the Word Cloud**:
   - Action: Create a visual representation of the most common words in the comments.
   - Purpose: To easily identify and visualize the most frequently mentioned words or themes in the reviews.


In [None]:
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

# 1. Combine all comments into a single text string
all_comments = ' '.join(combined_df_final['Comment'].dropna())

# 2. Text preprocessing - define stopwords to exclude from the analysis
stopwords = set(STOPWORDS)

# 3. Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white', stopwords=stopwords, max_words=100).generate(all_comments)

# Plot the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')  # Hide the axes
plt.show()