# Generating Synthetic Datasets with Fake Library

In this section, we will use Python Faker to generate synthetic data. It consists of 5 examples of how you can use Faker for various tasks. The main goal is to develop a privacy-centric approach for testing systems. In the last part, we will generate fake data to complement the original data using Faker's localized provider.

### Initiate a fake generator using `Faker()`

We wil initiate a fake generator using `Faker()`. By default, it is using "en_US" locale.

In [1]:
from faker import Faker
fake = Faker()

##### Example 1

The "fake" object can generate data by using property names. For example, `fake.name()` is used for generating a random person's full name.

In [2]:
print(fake.name())

Kevin Brown


Similarly, we can generate a fake email address, country name, text, geolocation, and URL, as shown below.

In [3]:
print(fake.email())
print(fake.country())
print(fake.name())
print(fake.text())
print(fake.latitude(), fake.longitude())
print(fake.url())

garciaarthur@example.com
Saudi Arabia
Kristina Hernandez
Age meeting brother discussion authority some your.
Who color other ball attorney. Never deal drive part. Statement evening real see.
-29.7156545 115.159453
http://www.riley.biz/


##### Example 2

You can use different locales to generate data in diverse languages and for distinct regions.

In the example below, we will generate data in Spanish and the region in Spain.

In [4]:
fake = Faker("es_ES")
print(fake.email())
print(fake.country())
print(fake.name())
print(fake.text())
print(fake.latitude(), fake.longitude())
print(fake.url())

jordana66@example.com
Tayikistán
Gema Ramis Andreu
Neque modi placeat quo ex soluta voluptates. Asperiores dolore veritatis. Numquam doloribus quas.
-33.4545455 176.746478
https://inmobiliaria.es/


Let's try again with the German language and Germany as the country. To generate a full profile, we will use the `profile()` function.

In [5]:
fake = Faker("de_DE")
fake.profile()

{'job': 'Florist',
 'company': 'Kraushaar GbR',
 'ssn': '413-76-0237',
 'residence': 'Sven-Aumann-Platz 732\n42404 Karlsruhe',
 'current_location': (Decimal('42.676512'), Decimal('38.238571')),
 'blood_group': 'B+',
 'website': ['http://www.gorlitz.com/', 'https://bauer.com/'],
 'username': 'miriambiggen',
 'name': 'Alena Bähr',
 'sex': 'F',
 'address': 'Ernststraße 5\n42931 Cuxhaven',
 'mail': 'gschulz@gmx.de',
 'birthdate': datetime.date(1970, 9, 1)}

##### Example 3

In this example, we will create a pandas dataframe using Faker

1. Create an empty pandas dataframe (data)

2. Pass it through x number of loops to create multiple rows

3. Use `randit()` to generate unique id

4. Use Faker to create a name, address, and geo-location

5. Run the `input_data()` function with x=10

In [6]:
from random import randint, choice
import pandas as pd

fake = Faker()

def input_data(x):
    
    # pandas dataframe
    data = pd.DataFrame()
    for i in range(0, x):
        data.loc[i, 'id']= randint(1, 100)
        data.loc[i, 'name']= fake.name()
        data.loc[i, 'address']= fake.address()
        data.loc[i, 'latitude']= str(fake.latitude())
        data.loc[i, 'longitude']= str(fake.longitude())
    return data

input_data(10)

Unnamed: 0,id,name,address,latitude,longitude
0,68.0,Jason Contreras,"25450 Serrano Ways Apt. 151\nAlexandraport, NV...",65.442878,136.837912
1,3.0,Roger Rocha,"71037 Lopez Station Apt. 827\nKirbyview, OK 83272",-78.5629995,-99.67974
2,81.0,Jesse Peters,"PSC 0419, Box 5660\nAPO AE 51864",64.935615,-119.589192
3,57.0,Kelly Osborn,"4278 Mcmillan Islands Suite 479\nDavidside, WA...",47.944649,-48.712804
4,76.0,Bradley Zhang,"PSC 2485, Box 8377\nAPO AA 11109",-67.994826,-156.823318
5,91.0,Joseph Lopez,"272 Amanda Squares\nAllenland, FM 95914",79.6503505,128.923951
6,99.0,Anna Martinez,"4723 Jose Streets\nLawsonbury, LA 19380",26.5768215,-0.121462
7,87.0,Tony Burke,"PSC 9778, Box 3172\nAPO AE 50183",-82.908033,153.009792
8,6.0,George Kane,"1594 Blackwell Loaf Suite 747\nPort Sara, VA 6...",-21.995434,-138.977059
9,58.0,Keith Barnett,"0316 Santos Burgs Apt. 388\nNorth Taramouth, O...",-89.1795615,131.388611


To reproduce the result, we have to set the seed. So whenever we run the code cell again, we wil get similar results.

In [7]:
Faker.seed(2)
input_data(10)

Unnamed: 0,id,name,address,latitude,longitude
0,43.0,Theresa Brown,"449 Catherine Prairie\nSouth Danielle, AS 95267",46.651752,19.748349
1,14.0,Jill Adams,"7566 Ann Freeway\nNorth Gregory, HI 18418",46.946829,93.924342
2,22.0,Jose Castro,Unit 7685 Box 9557\nDPO AE 99329,17.337539,67.715834
3,79.0,Danielle Ramirez,"775 Tucker Forges Suite 294\nCollinshaven, OH ...",46.1925255,169.725443
4,27.0,Jennifer Simon,"915 Rebecca Field Apt. 090\nBrittanyville, WV ...",-53.358589,-37.277591
5,55.0,Daniel Gilmore,"55230 Darius Cliff\nNorth Jessicatown, MN 34014",-55.697404,-95.647764
6,15.0,Kelly Smith,"00591 Rogers Burgs\nShaneton, MH 35096",17.871857,153.716342
7,2.0,Kaitlyn Perez,"07289 Tucker Islands Apt. 544\nValdezside, DC ...",-74.757968,-44.189418
8,10.0,Christopher Davis,"37141 Claudia Union\nJonathanchester, KY 69469",-88.704857,-98.87051
9,22.0,Jason Rodriguez,"10231 Andrew Skyway\nChavezside, SD 50308",-32.9709565,-67.18167


## Now, Generating Data for Fufu Republic Dimensional Model

##### For the Customer Table

In [9]:
def customer_data(x):
    customer = pd.DataFrame()
    
    # Set a starting phone number (e.g., +234 for Nigeria's country code)
    base_phone_number = "+2348000000000"
    
    for i in range(0, x):
        customer.loc[i, 'Customer_id'] = int(randint(1, 100))
        customer.loc[i, 'Name'] = fake.name()
        # Ensure the phone numbers are unique by incrementing the base phone number
        customer.loc[i, 'Phone_number'] = f"{base_phone_number[:-len(str(i))]}{i}" 
        customer.loc[i, 'Email'] = fake.email()
        
    # Cast Customer_id to integer
    customer['Customer_id'] = customer['Customer_id'].astype(int)
    
    return customer

customer_data = customer_data(50)
customer_data.to_csv('customer.csv', index=False) # Saving data to csv file

customer_data.head()

Unnamed: 0,Customer_id,Name,Phone_number,Email
0,28,Jessica Hunt,2348000000000,keithsarah@example.net
1,78,Brandon Lang,2348000000001,andrew19@example.com
2,100,William Watson,2348000000002,sawyerrobert@example.org
3,35,Ryan Chambers,2348000000003,burnettbrian@example.net
4,29,Mr. Kevin Jones,2348000000004,andrew41@example.org


##### For the Branch Table

In [10]:
fake = Faker()

def Branch(num_branches):
    # pandas dataframe
    branch = pd.DataFrame()
    
    branch_suffix = ['Downtown', 'Mall', 'Plaza', 'Corner', 'Central', 'Westside', 'Eastside']
    
    for i in range(num_branches):
        branch.loc[i, 'Branch_id'] = i + 1  # Unique Branch ID for each branch
        branch.loc[i, 'Branch_name'] = f"Fufu Republic {branch_suffix[i % len(branch_suffix)]}"  # Generate unique branch name
        branch.loc[i, 'City'] = fake.city()  # Generate random city
        branch.loc[i, 'Manager'] = fake.name()  # Generate random manager name
        
    # Cast Branch_id to integer
    branch['Branch_id'] = branch['Branch_id'].astype(int)
    
    return branch

# Generate 7 unique branches
branch_data = Branch(7)
branch_data.to_csv('branch.csv', index=False) # Saving data to csv file

# Display branch data for verification
branch_data.head()

Unnamed: 0,Branch_id,Branch_name,City,Manager
0,1,Fufu Republic Downtown,Teresamouth,Steven Allen
1,2,Fufu Republic Mall,Taylortown,Jonathan Brown
2,3,Fufu Republic Plaza,New Samantha,Kelly Maldonado
3,4,Fufu Republic Corner,East Nathan,Joshua Wright
4,5,Fufu Republic Central,Lewiston,Lori Rogers


##### For the Item (Menu) Table

In [11]:
fake = Faker()

def item_data(num_items):
    # Predefined list of food items sold by Fufu Republic
    food_items = ['Fufu', 'Jollof Rice', 'Egusi Soup', 'Pounded Yam', 'Suya', 'Fried Plantain', 
                  'Pepper Soup', 'Moi Moi', 'Goat Meat', 'Chicken Wings', 'Okra Soup', 'Zobo Drink']
    
    categories = ['Main Course', 'Appetizer', 'Dessert', 'Beverage']
    item = pd.DataFrame()
    
    for i in range(num_items):
        item.loc[i, 'Item_id'] = i + 1  # Assign unique IDs starting from 1
        item.loc[i, 'Name'] = choice(food_items)  # Choose random food name from predefined list
        item.loc[i, 'Category'] = choice(categories)  # Choose random category
        item.loc[i, 'Price'] = round(fake.random_number(digits=2), 2)  # Generate random price between 0 and 99
        
    # Cast Item_id to integer
    item['Item_id'] = item['Item_id'].astype(int)
    
    return item

# Generate data for 50 items
item_data = item_data(50)
item_data.to_csv('item.csv', index=False)

# Display the generated data for verification
item_data.head()

Unnamed: 0,Item_id,Name,Category,Price
0,1,Chicken Wings,Dessert,18.0
1,2,Fried Plantain,Main Course,54.0
2,3,Moi Moi,Main Course,16.0
3,4,Chicken Wings,Appetizer,50.0
4,5,Zobo Drink,Dessert,40.0


##### For the Order Header Table

In [12]:
fake = Faker()

# Use these from the data you generated earlier
branch_ids = branch_data['Branch_id'].unique().tolist()  # List of unique branch IDs
item_ids = item_data['Item_id'].unique().tolist()        # List of unique item IDs
customer_ids = list(range(1, 101))  # Assume 100 unique customers (based on your customer data)
payment_method_ids = list(range(1, 6))  # Assuming 5 unique payment methods
promotion_ids = list(range(1, 11))  # Assuming 10 unique promotions
inventory_ids = list(range(1, 51))  # Assuming 50 unique inventory items

def order_header_data(x):
    order_header = pd.DataFrame()

    for i in range(0, x):
        order_header.loc[i, 'Order_id'] = randint(1, 1000)
        order_header.loc[i, 'Item_id'] = choice(item_ids)  # Use valid item IDs
        order_header.loc[i, 'Branch_id'] = choice(branch_ids)  # Use valid branch IDs
        order_header.loc[i, 'Customer_id'] = choice(customer_ids)  # Use valid customer IDs
        order_header.loc[i, 'Payment_method_id'] = choice(payment_method_ids)  # Use valid payment method IDs
        order_header.loc[i, 'Promotion_id'] = choice(promotion_ids)  # Use valid promotion IDs
        order_header.loc[i, 'Inventory_id'] = choice(inventory_ids)  # Use valid inventory IDs
        order_header.loc[i, 'Discount_amount'] = round(fake.pyfloat(left_digits=2, right_digits=2, positive=True), 2)
        order_header.loc[i, 'Dining_option'] = choice(['Dine-in', 'Take-out', 'Online'])
        order_header.loc[i, 'Order_time'] = fake.time()
        order_header.loc[i, 'Order_date'] = fake.date_this_year()
        
    # Cast specific columns to integers
    int_columns = ['Order_id', 'Item_id', 'Branch_id', 'Customer_id', 
                   'Payment_method_id', 'Promotion_id', 'Inventory_id']
    order_header[int_columns] = order_header[int_columns].astype(int)

    return order_header

# Generate 50 orders
order_header_data = order_header_data(50)
order_header_data.to_csv('order_header.csv', index=False)

# Preview the first few rows
order_header_data.head()

Unnamed: 0,Order_id,Item_id,Branch_id,Customer_id,Payment_method_id,Promotion_id,Inventory_id,Discount_amount,Dining_option,Order_time,Order_date
0,459,43,5,87,1,7,44,40.4,Dine-in,00:41:55,2024-03-17
1,372,16,4,28,1,9,6,13.17,Take-out,20:25:30,2024-07-30
2,154,6,2,5,2,7,48,23.97,Dine-in,16:18:10,2024-09-23
3,308,20,1,73,5,6,7,59.8,Take-out,14:25:24,2024-03-22
4,874,28,6,42,3,3,13,20.41,Online,20:16:38,2024-04-17


##### For the Order Item Table

In [13]:
# Use these from the data you generated earlier
item_ids = item_data['Item_id'].unique().tolist()  # List of unique item IDs
order_ids = order_header_data['Order_id'].unique().tolist()  # List of unique order IDs

def order_items_data(x):
    order_items = pd.DataFrame()

    for i in range(0, x):
        order_items.loc[i, 'Order_item_id'] = randint(1, 1000)
        order_items.loc[i, 'Order_id'] = choice(order_ids)  # Use valid order IDs
        order_items.loc[i, 'Item_id'] = choice(item_ids)  # Use valid item IDs
        order_items.loc[i, 'Quantity'] = randint(1, 5)
        
    # Cast specific columns to integers
    int_columns = ['Order_item_id', 'Order_id', 'Item_id', 'Quantity']
    order_items[int_columns] = order_items[int_columns].astype(int)
    
    return order_items

# Generate 50 order items
order_items_data = order_items_data(50)
order_items_data.to_csv('order_items.csv', index=False)

# Preview the first few rows
order_items_data.head()

Unnamed: 0,Order_item_id,Order_id,Item_id,Quantity
0,795,590,44,4
1,642,38,7,2
2,989,319,3,1
3,842,460,34,3
4,694,672,10,5


##### For the Inventory Table

In [14]:
# Use the existing data from item_data and branch_data tables
item_ids = item_data['Item_id'].unique().tolist()  # List of unique item IDs
branch_ids = branch_data['Branch_id'].unique().tolist()  # List of unique branch IDs

def inventory_data(x):
    inventory = pd.DataFrame()

    for i in range(0, x):
        inventory.loc[i, 'Inventory_id'] = randint(1, 1000)  # Use a wider range for unique Inventory IDs
        inventory.loc[i, 'Item_id'] = choice(item_ids)  # Use valid Item IDs from item_data
        inventory.loc[i, 'Branch_id'] = choice(branch_ids)  # Use valid Branch IDs from branch_data
        inventory.loc[i, 'Stock_level'] = randint(0, 500)
        inventory.loc[i, 'Reorder_level'] = randint(10, 50)
        inventory.loc[i, 'Date'] = fake.date()
        
    # Cast specific columns to integers
    int_columns = ['Inventory_id', 'Item_id', 'Branch_id', 'Stock_level', 'Reorder_level']
    inventory[int_columns] = inventory[int_columns].astype(int)
    
    return inventory

# Generate 50 inventory records
inventory_data = inventory_data(50)
inventory_data.to_csv('inventory.csv', index=False)

# Preview the first few rows
inventory_data.head()

Unnamed: 0,Inventory_id,Item_id,Branch_id,Stock_level,Reorder_level,Date
0,419,36,1,14,36,1984-01-20
1,862,30,6,206,43,1991-11-22
2,706,30,6,269,19,1986-06-02
3,532,1,7,408,42,2006-08-04
4,647,23,7,312,12,1973-04-22


##### For the Payment Method Table

In [16]:
def payment_method_data():
    payment_method = pd.DataFrame()

    # Define payment methods and their corresponding providers
    payment_methods = [
        {'Name': 'Cash', 'Provider': 'N/A'},  # Cash has no provider
        {'Name': 'Debit Card', 'Provider': 'Nomba POS'},  # Debit card uses Nomba POS
        {'Name': 'Online Payment', 'Provider': 'Nomba Web Checkout'},
        {'Name': 'Online Payment', 'Provider': 'Paystack'},
        {'Name': 'Online Payment', 'Provider': 'Interswitch'}
    ]

    # Iterate through each method to populate the table
    for i, method in enumerate(payment_methods):
        payment_method.loc[i, 'Payment_method_id'] = i + 1  # Unique Payment Method ID
        payment_method.loc[i, 'Name'] = method['Name']  # Payment method name
        payment_method.loc[i, 'Provider'] = method['Provider']  # Provider name
        
    # Optionally, you can cast the entire DataFrame columns if needed
    payment_method['Payment_method_id'] = payment_method['Payment_method_id'].astype(int)


    return payment_method

# Generate the payment methods data
payment_method_data = payment_method_data()
payment_method_data.to_csv('payment_method.csv', index=False)

# Preview the data
payment_method_data.head()

Unnamed: 0,Payment_method_id,Name,Provider
0,1,Cash,
1,2,Debit Card,Nomba POS
2,3,Online Payment,Nomba Web Checkout
3,4,Online Payment,Paystack
4,5,Online Payment,Interswitch


##### For the Promotion Table

In [18]:
fake = Faker()

def promotion_data():
    promotion = pd.DataFrame()

    # Define unique promotion names
    promotion_names = [
        "Happy Hour",
        "Weekend Special",
        "Loyalty Reward",
        "Family Feast",
        "Seasonal Discount"
    ]

    # Create 5 promotions with unique names
    for i in range(len(promotion_names)):
        promotion.loc[i, 'Promotion_id'] = i + 1  # Unique Promotion ID
        promotion.loc[i, 'Name'] = promotion_names[i]  # Unique promotion name
        promotion.loc[i, 'Discount_amount'] = randint(5, 50)  # Random discount percentage
        promotion.loc[i, 'Validity_period'] = fake.date_this_year()  # Validity period
        
    # Optionally, you can cast the entire DataFrame columns if needed
    promotion['Promotion_id'] = promotion['Promotion_id'].astype(int)

    return promotion

# Generate the promotion data
promotion_data = promotion_data()
promotion_data.to_csv('promotion.csv', index=False)

# Preview the data
promotion_data.head()

Unnamed: 0,Promotion_id,Name,Discount_amount,Validity_period
0,1,Happy Hour,19.0,2024-01-18
1,2,Weekend Special,30.0,2024-05-12
2,3,Loyalty Reward,45.0,2024-06-17
3,4,Family Feast,25.0,2024-09-02
4,5,Seasonal Discount,44.0,2024-01-04
