<a href="https://colab.research.google.com/github/BenGCollier/CIDM-6356/blob/main/How_to_Generate_Simulated_Data_with_Faker.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>How to Generate Simulated Data with Faker</h1>
by Dr. Sean Humpherys<br>
Faker is useful for data scientists and software engineers to create simulated data and simulated databases. For example, use Faker if you need to create simulated customers and simulated orders. Run the code cells below to explore how to use Faker. Run the code cells multiple times so you can see the random nature of the code. Experiment with the arguments to understand how things work. Any chages to code you make will be made to your session and not impact Dr. Humpherys' original code. Copy this notebook if you want to save your changes. <br>
 <a href="https://faker.readthedocs.io/en/master/">Faker documentation (useful, but hard to understand)</a><br>
<a href="https://faker.readthedocs.io/en/master/locales/en_US.html">Faker index of methods (easier to understand)</a><br>


### Video Lecture:
<a href="https://www.screencast.com/t/Ebwun385g"  target="_blank">Part 1.</a> Dr. Humpherys explains the faker package, how to use it, how to add
fake data to a pandas dataframe, and export the dataframe to a csv file (25 mins).<br>
<a href="https://www.screencast.com/t/C2EIHUs7pD" target="_blank">Part 2.</a> How to create random numbers and related numeric data (11 mins).<br>
<a href="https://www.screencast.com/t/gbeDzWrO1fi6" target="_blank">Part 3.</a> How to create random dates, text, and adjust the probability of the data (15 mins). <br>


In [None]:
#Install the Faker package. The ! means run this command in the terminal.
!pip install faker

In [None]:
#Import various packages
from faker import Faker
from datetime import date
import pandas as pd

#create a faker object
fake = Faker()

In [None]:
#Run this code cell several times to see the random selections each time.
fake.name_female()

In [None]:
# Create 20 names and append to a list
customer_names = [] #empty list of customer names names
for _ in range(20):
  customer_names.append(fake.name())

customer_names

Note. You can change the localization if you want names from different countries. See the documentation for localized versions of Faker.

In [None]:
#Create an entire profile, but the data may be too real. Consider how to anonymize.
fake.profile()

In [None]:
'''Example of a more anonymized profile with random word
instead of a first name, excludes social security number,
and adds a safe email that is not real.
'''
customer = {
    'first_name': fake.word().capitalize(), #random word. Use .capitalize() to capitalize the first letter.
    'last_name': fake.last_name(),
    'job': fake.job(),
    'company': fake.company(),
    'cellphone': fake.phone_number(),
    'residence': fake.address(),
    'email': fake.safe_email(),  #safely generates only @example domains.
    'customer_id': fake.ean8() #ean is for barcode number but handy for long random integers.
}
customer

Our dictionary cannot be directly added to a panda dataframe without first agregating the dictionary(ies) into a list. </br>
Below is an example of creating a list of dictionaries that can be added to a pandas dataframe. An alternative is to add the dictionary to a Panda Series object, then add the Series to a Panda DataFrame, but the list method is easier.

In [None]:
# Define a function for our custom, customer profile generator.
def create_customer_profile():
  '''
  Create a custom, customer profile.
  Returns a dictionary object.
  '''
  customer = {
    'first_name': fake.word().capitalize(), #random word. Use .capitalize() to capitalize the first letter.
    'last_name': fake.last_name(),
    'job': fake.job(),
    'company': fake.company(),
    'cell_phone': fake.phone_number(),
    'residence': fake.address(),
    'email': fake.safe_email(),  #safely generates only @example domains.
    'customer_id': fake.ean8() #ean is a barcode number but handy for long random integers
    }
  return customer


#Add ten customers to a list
COUNT = 10
customers = [] #create empty list
for _ in range(COUNT):
  customers.append(create_customer_profile()) #calls create_customer_profile() and append to customers_list

customers

In [None]:
# Add the list of customers to a pandas datafram
customers_df = pd.DataFrame(customers)
customers_df

In [None]:
# Save the dataframe as a csv file
customers_df.to_csv('fake_customers.csv')

This fake_customers.csv file is saved to the Colab file system. You must manually download the file to your hard drive BEFORE terminating your Colab session or the file will be deleted automatically. Click the folder icon (far left), hover over fake_customers.csv, click the three dots &vellip; , then click 'download'.

###Adding more rows to an existing dataframe or concatenate two dataframes  

In [None]:
#Add ten customers to a list
COUNT = 5 #How many customers do you want?
more_customers = [] #create empty list
for _ in range(COUNT):
  more_customers.append(create_customer_profile()) #call create_customer_profile() and append to customers_list

more_customers_df = pd.DataFrame(more_customers)
dfs_to_combine = [customers_df, more_customers_df]
customers_df = pd.concat(dfs_to_combine, ignore_index=True)

In [None]:
customers_df.tail()  #see the newly added customers

In [None]:
# Sometimes you mess up your dataframe and need to delete it. Here's how.
customers_df = pd.DataFrame() #create an empty dataframe.
customers_df

# Alternative Method for Adding Lists to a DataFrame
This example creates several lists of random stuff related to logins. The first item in each list coresponds to the first item in the other lists. The second item in the list corresponds to the second item in the other lists, etc. Collectively, lists are passed to a panda dataframe which converts each list into a column and each item in the list becomes a row in the dataframe.

In [None]:
COUNT = 100 #How many rows do you want?
user_names = [] #empty list
passwords = []
last_logins = []

#Populate the lists with random selections
for _ in range(COUNT):
  user_names.append(fake.user_name()) #randomly generates a username
  passwords.append(fake.password()) #randomly generates a password
  last_logins.append(fake.date_this_month()) #randomly selects a date

# Notice the { } must be included becuase the list are temporarily stored in a dictionary and then
# passed to the dataframe().
user_log_df = pd.DataFrame({'UserName': user_names, 'Passwords': passwords, 'LastLogin': last_logins})
user_log_df

## How to create random numbers, etc.

In [None]:
fake.pyfloat()

In [None]:
fake.pyint(50, 200)  #generate random integer between two values

In [None]:
fake.pybool()  #create boolean values.

In [None]:
# Return True or False with a 85% probability of True
# Useful for creating unbalanced data
for _ in range(100):
  print(fake.pybool(truth_probability= 85))

In [None]:
#Use probability in an if statment
#85 percent chance of generating a female name
people = []
if fake.pybool(truth_probability= 85):
  people.append(fake.name_female())
else:
  people.append(fake.name_male())

people

In [None]:
# This is how Dr. Humpherys created a dataset of fake product returns where
# some returns did not have a receipt.
has_receipt = []
for _ in range(20):
  has_receipt.append(fake.pybool(truth_probability= 85))

has_receipt

In [None]:
# Randomly generates a price tag
# Problem: There is no between values so the prices can range by $50,000 or more
# and the output returns dollar sign and commas.
fake.pricetag()

In [None]:
#Solution is to use random as follows. Code generated by Gemini from Dr. Humpherys' prompts.
import random

def generate_random_price(min_price, max_price):
  """Generates a random price between min_price and max_price with two decimal places.

  Args:
    min_price: The minimum price (inclusive).
    max_price: The maximum price (exclusive).

  Returns:
    A random price as a string with two decimal places.
  """
  price = random.uniform(min_price, max_price)
  return round(price, 2)

# Example usage
generate_random_price(1.00, 400.00)


In [None]:
# Fake credit card and expiration date in a tuple.
fake.credit_card_number(), fake.credit_card_expire()

In [None]:
fake.upc_a() #fake barcode number

## How to create random dates <br>
Faker returns a datetime object, which is ok for use in a Panda dataframe or for exporting to a csv file. You do not need to format the date unless you want to.

In [None]:
fake.date() #Returns a string object

In [None]:
fake.date('%m-%d-%Y') #Format the string date as desired

In [None]:
fake.date('%a %D')
#Other string formats are available. See https://www.w3schools.com/python/python_datetime.asp

In [None]:
#Get a date object. start_date defaults to 30 years ago and end_date defaults to today.
fake.date_between()

In [None]:
# '-2y' calculates two years earlier than today
# '-1y' calculates one year earlier than today
# You can change the integer for a different year range
fake.date_between(start_date='-2y', end_date='-1y')

In [None]:
#How to specify a date range

#The "Incorrect" code will not work becuase the
#start_date and end_date require a date object, not a string.

#Incorrect
#fake.date_between(start_date='04/01/2023', end_date='12/31/2023')

#Correct
import datetime as dt
start_date = dt.datetime(2020, 4, 1)
end_date = dt.datetime(2023, 12, 31)
fake.date_between(start_date, end_date)

In [None]:
#Faker's date_time() methods works the same as the date() methods
# and adds a time portion to the date
fake.date_time()

In [None]:
fake.date_time_between()

In [None]:
# Generates a date of birth with a minimum age of 0 to maximum age of 115.
fake.date_of_birth()

## How to Genereate Random Text

In [None]:
fake.word()

In [None]:
fake.paragraph() #generate a random paragraph

In [None]:

fake.paragraph(5) #random paragraph with aproximately 5 sentences

### How to customize the words list
You can provide your own sets of words if you don't want to use the default lorem ipsum one. This can be helpful for randomly selecting items you put in a list.

In [None]:
#Randomly select sugary treats
treats = [
'danish','cheesecake','donut',
'Lollipop','wafer','Gummies',
'licorice','Jelly beans',
'pie','candy bar','Icecream', 'fudge sunday' ]

fake.word(ext_word_list = treats)

In [None]:
#Randomly select sugary treats to create a sentence
treats = [
'danish','cheesecake','donut',
'Lollipop','wafer','Gummies',
'licorice','Jelly beans',
'pie','bar','Icecream', 'fudge sunday' ]

fake.sentence(ext_word_list = treats)

#Random Choices and Random Probabilities

Consider learning more about random and its methods. Random is useful when you need a random selection of something you specify in a list.

random.choice() https://www.w3schools.com/python/ref_random_choice.asp   <br>
random.choices() with weighted probabilities https://www.geeksforgeeks.org/how-to-get-weighted-random-choice-in-python/  <br>


In [None]:
#Create a list of stuff and random.choice() will randomly select one of the items
import random
healthcare_jobs = ['dr', 'LPN', 'nurse', 'tech']
random.choice(healthcare_jobs)

In [None]:
#random.choices() will create a list of as many choices as you specify
#notice the s in random.choices()
#argument k is the number of items you wish to randomly return
import random
healthcare_jobs = ['dr', 'LPN', 'nurse', 'tech']
random.choices(healthcare_jobs, k=20) #20 items

You can change the probability of selecting an item from the list by specifying weights for each item.
Each item in your list needs to have a corresponding weight value, as exampled.

```
healthcare_jobs = ["dr", 'LPN', 'nurse', 'tech']
weights = [1, 1, 2, 5]
```
 The integer in the weights list can be thought of as a ratio (not exactly but helpful), e.g. "Randomly select a healthcare job where the ratio is 1 doctor for 1 LPN for 2 nurses for 5 techs." This will not get you an exact ratio, but nurses will be more likely than doctors and LPNs and techs will be more likely than the others.

In [None]:
healthcare_jobs = ["dr", 'LPN', 'nurse', 'tech']
weights = [1, 1, 2, 5]  #ratio of how many of each to randomly select
random.choices(healthcare_jobs, weights, k=200)

In [None]:
# Alternative example using different weights
healthcare_jobs = ["dr", 'LPN', 'nurse', 'tech']
weights = [1, 3, 6, 11]  #ratio of how many of each to randomly select
random.choices(healthcare_jobs, weights, k=20)

If you need data that is random but predictably random (i.e., repeatable each time you run your code), consider learning about the `fake.seed_instance()` method at https://faker.readthedocs.io/en/master/index.html#seeding-the-generator <br>
This feature is useful if you need to run the same test multiple times or share you code with someone else who is attempted to replicate your output.