## Synthetic Data Generation Amazon Reviews 

The purpose of the Faker library in this code is to generate synthetic and realistic-looking data, particularly in the context of creating fake reviews . Faker provides a convenient way to generate various types of fake data, such as names, addresses, paragraphs, and words. In this code, it is specifically used to create realistic-sounding paragraphs for the reviews. By combining these generated paragraphs with carefully selected product-related keywords, the code simulates Amazon product reviews that contain specific aspects or features, as defined in the aspects_keywords dictionary.

The primary objective of the entire code is to generate synthetic data for training or testing an NLP (Natural Language Processing) model related to Amazon product reviews. It defines aspects such as 'Aesthetics,' 'Durability,' 'Ease of Use,' and others, each associated with a set of keywords. The generate_synthetic_data function uses Faker to create unique and diverse review texts, ensuring that each generated review corresponds to a specific aspect based on the presence of associated keywords. The resulting synthetic data, stored in a Pandas DataFrame, can be used to train or evaluate the performance of an NLP model in classifying reviews based on predefined aspects. The generated data includes a variety of aspects to simulate the diversity found in real-world product reviews.




In [4]:
#!pip install faker #package version faker-20.0.3

##### Importing libraries

In [5]:
import pandas as pd
from faker import Faker
import random

In [None]:
# Initialize Faker
fake = Faker()

# Define aspects and corresponding keywords
aspects_keywords = {
    'Aesthetics': ['crisp', 'beautiful', 'wrinkled'],
    'Ease of Reprocessing': ['wash', 'clean', 'charge'],
    'Durability': ['wear', 'died', 'resistant'],
    'Use Efficiency': ['time', 'fast, long'],
    'Performance': ['hold', 'well', 'glitch'],
    'Adaptability': ['versatile', 'outside', 'suitable'],
    'Ergonomics': ['comfortable', 'easy', 'awkward'],
    'Ease of Storage': ['store', 'fold', 'small'],
    'Ease of Use': ['use', 'easy', 'convenient'],
    'Interference': ['loud', 'taste', 'smell'],
    'Safety': ['safe', 'drop', 'burn'],
    'Price': ['expensive', 'cheap', 'cost']
}

# Updated product_keywords list for Amazon product reviews
product_keywords = [
    'Excellent', 'High-quality', 'Durable', 'Efficient', 'Impressive', 'Versatile',
    'Comfortable', 'User-friendly', 'Compact', 'Sleek', 'Convenient', 'Innovative',
    'Sturdy', 'Reliable', 'Outstanding', 'Attractive', 'Easy-to-use', 'Affordable',
    'Premium', 'Satisfying', 'Elegant', 'Superior', 'Exceptional', 'Satisfactory',
    'Pleasurable', 'Top-notch', 'Fantastic', 'Value-for-money', 'Satisfying',
    'Well-designed', 'Smooth', 'Functional', 'Amazing', 'Great', 'Good', 'Quality',
    'Perfect', 'Satisfied', 'Recommended', 'Happy', 'Pleased', 'Worth', 'Awesome',
    'loud', 'taste', 'smell'
]

# Keep track of generated reviews to avoid duplicates
generated_reviews = set()

# Function to assign aspect labels based on keywords
def assign_aspect_label(review_text):
    for aspect, keywords in aspects_keywords.items():
        if any(keyword in review_text.lower() for keyword in keywords):
            return aspect
    return None

# Function to generate synthetic data
def generate_synthetic_data(num_rows):
    data = {'Review': [], 'Aspect': []}

    for _ in range(num_rows):
        
        # Generate unique review text with product-related keywords
        while True:
            review_text = fake.paragraph(nb_sentences=3)
            review_text += ' '.join(fake.words(nb=random.randint(2, 5), ext_word_list=product_keywords))

            aspect_label = assign_aspect_label(review_text)
            if aspect_label is not None:
                break

        # Append to the data dictionary
        data['Review'].append(review_text)
        data['Aspect'].append(aspect_label)

    # Create a DataFrame from the data dictionary
    df = pd.DataFrame(data)
    return df

# Generate synthetic data with at least 90000 rows
synthetic_data = generate_synthetic_data(num_rows=90000)



In [None]:
# Display the first few rows of the synthetic data
synthetic_data.head()
synthetic_data.to_csv('Amazon_Synthetic_Training_Data.csv',index=False)