# Simulating Foot-Traffic Data With Faker

Faker is a Python package that allows you to generate fake data such as names, addresses, and phone numbers. It can be useful for generating test data for applications, populating databases with fake information, or anonymizing sensitive data. The package uses various localized data sources, such as lists of names and addresses specific to different countries, to generate the fake data. It also allows you to customize the generated data to a certain extent, for example, specifying the format of a phone number or the gender of a name.

## Importing the Faker package

In [1]:
from faker import Faker

fake = Faker()

Once the `fake` object has been initialized, we can use it to generate data values from dozens of categories, called "providers". You can find the full list of providers in the Faker [documentation](https://faker.readthedocs.io/en/master/providers.html).

In [2]:
# generate a fake name
print(fake.name())

#generate male and female names
print([fake.first_name_female(), fake.first_name_male()])

#generate a random date
print(fake.date())

#generate a realistic birthdate
print(fake.date_of_birth(minimum_age=13, maximum_age=100))

#generate fake address
print(fake.address())

#generate fake user profile data
print(fake.profile())

Michael Cummings
['Shannon', 'Jose']
1972-08-14
2010-01-18
123 Steven Mountains Suite 788
Willisbury, TX 08482
{'job': 'Multimedia programmer', 'company': 'Bennett, Newton and Smith', 'ssn': '159-31-6334', 'residence': '62797 Bass Glen\nNew Heidi, NJ 27273', 'current_location': (Decimal('18.380034'), Decimal('30.335851')), 'blood_group': 'O+', 'website': ['https://www.patel-becker.biz/', 'https://www.reyes.net/', 'https://www.mcclain-york.com/', 'http://www.gonzalez-daugherty.com/'], 'username': 'robertnewman', 'name': 'Denise George', 'sex': 'F', 'address': '108 Green Heights\nNorth Natalie, RI 93850', 'mail': 'pmiller@gmail.com', 'birthdate': datetime.date(1979, 10, 22)}


## Using Faker to create the `stores` table

The `stores` table will be used to track the different store locations throughout the country, including their coordinates, city, and state. We want this data to be as realistic as possible so that we can map them later in the workshop, so we will be using the `Nominatim` package from the `geopy` library to...

Create the `generate_store()` function

In [None]:
fake.local_latlng()

('41.48199', '-81.79819', 'Lakewood', 'US', 'America/New_York')

In [3]:
from geopy import Nominatim

locator = Nominatim(user_agent='myGeocoder')

def generate_store():
    
    coords = fake.local_latlng(country_code="US")
    location = locator.reverse(coords[:2]).raw
    
    try:
        city_town = location["address"]["city"]
    except:
        try:
            city_town = location["address"]["town"]
        except:
            city_town = location["address"]["county"]
    
        
    
    store = {
        "store_id": fake.pyint(),
        "opened_date": str(fake.date_this_century()),
        "latitude": coords[0],
        "longitude": coords[1],
        "store_address": " ".join([str(fake.pyint()), location["address"]["road"]]),
        "city": city_town,
        "state": location["address"]["state"]
    }
    
    return store

generate_store()

{'store_id': 7452,
 'opened_date': '2005-06-13',
 'latitude': '40.72816',
 'longitude': '-74.07764',
 'store_address': '5186 Olean Avenue',
 'city': 'Jersey City',
 'state': 'New Jersey'}

Create `generate_stores()` function

In [4]:
import pandas as pd

def generate_stores(num_stores):
    
    stores = [generate_store() for i in range(num_stores)]
    
    return pd.DataFrame(stores)

generate_stores(5)

Unnamed: 0,store_id,opened_date,latitude,longitude,store_address,city,state
0,3913,2013-02-09,47.4943,-122.24092,7239 Renton Avenue South,Seattle,Washington
1,3718,2005-11-16,33.76446,-117.79394,6731 Risa Place,Orange County,California
2,5590,2003-07-30,37.73604,-120.93549,4189 Santa Fe Street,Riverbank,California
3,5095,2007-11-15,37.95143,-91.77127,1465 Acorn Trail,Rolla,Missouri
4,5737,2001-10-26,35.61452,-88.81395,2383 East Main Street,Jackson,Tennessee


Create table of 50 stores

In [5]:
stores = generate_stores(50)

map newly generated store locations

In [6]:
import folium

m = folium.Map(location=[39.8283, -98.5795], zoom_start=4)

for x,y in stores.iterrows():
    folium.Marker(location=[y.latitude, y.longitude], radius=5, tooltip=f"{y.city}, {y.state}").add_to(m)

m

Exercise: create the function `generate_customer()` that will generate a dictionary of customer data with the following attributes:

```
customer_id
customer_name
customer_birthday
customer_email
is_member
card_on_file

```

You can look through the Faker [documentation](https://faker.readthedocs.io/en/master/providers.html) to help you.

Next, create the function `generate_customers()` that will generate a dataframe of a given number of customers

In [7]:
def generate_customer():
    
    customer = {
        "customer_id": fake.uuid4().split("-")[0],
        "customer_name": fake.name(),
        "customer_birthday": fake.date_of_birth(minimum_age=13, maximum_age=110),
        "customer_email": fake.email(),
        "is_member": fake.boolean(),
        "card_on_file": fake.credit_card_provider()
        
    }
    
    return customer

generate_customer()

{'customer_id': '04bb2245',
 'customer_name': 'Sara Olsen',
 'customer_birthday': datetime.date(1991, 5, 23),
 'customer_email': 'bradleymcdonald@gmail.com',
 'is_member': False,
 'card_on_file': 'JCB 16 digit'}

In [8]:
def generate_customers(num_customers):
    
    customers = [generate_customer() for i in range(num_customers)]
    
    return pd.DataFrame(customers)

generate_customers(5)

Unnamed: 0,customer_id,customer_name,customer_birthday,customer_email,is_member,card_on_file
0,f5cf1644,Julie Daniels,1950-10-20,millertodd@hayden.com,False,JCB 15 digit
1,4eec1f48,Ashley Ho,1925-11-25,shelly48@levine-west.com,False,JCB 16 digit
2,6ebe6864,Alexandra Johnson,1935-11-23,ghamilton@cook-howard.com,False,JCB 15 digit
3,9a8c53f9,Brandon Cruz,1948-04-20,gary48@gmail.com,False,JCB 15 digit
4,84213c3d,Sara Thompson,1981-04-07,destinynovak@gmail.com,True,Maestro


In [9]:
customers = generate_customers(500)
customers.describe()

Unnamed: 0,customer_id,customer_name,customer_birthday,customer_email,is_member,card_on_file
count,500,500,500,500,500,500
unique,500,498,496,500,2,10
top,d558efbc,Heather Smith,1986-01-08,stevenmora@hotmail.com,False,VISA 16 digit
freq,1,2,2,1,262,80


Create the `generate_visits()` function

In [10]:
import random

def generate_visit(store_df, customer_df, visit_date="01-01-2022"):
    
    visit = {
        "visit_id": str(fake.uuid4().split("-")[0]),
        "visit_date": visit_date,
        "store_id": store_df.sample().store_id.values[0],
        "customer_id": customer_df.sample().customer_id.values[0],
        "order_total": round(random.random() * random.choice([10, 100, 500, 1000]), 2),
        "payment_method": random.choice(["cash", "credit"]),
    }
    
    return visit

generate_visit(stores, customers)

{'visit_id': 'c14e3b00',
 'visit_date': '01-01-2022',
 'store_id': 2824,
 'customer_id': '16628533',
 'order_total': 435.65,
 'payment_method': 'cash'}

Create `generate_visits()` function

In [11]:
def generate_visits(num_visits, store_df, customer_df, visit_date="01-01-2022"):
    
    def generate_visit(store_df, customer_df, visit_date=visit_date):
    
        visit = {
            "visit_id": str(fake.uuid4().split("-")[0]),
            "visit_date": visit_date,
            "store_id": store_df.sample().store_id.values[0],
            "customer_id": customer_df.sample().customer_id.values[0],
            "order_total": round(random.random() * random.choice([10, 100, 500, 1000]), 2),
            "payment_method": random.choice(["cash", "credit"]),
        }
        
        return visit
    
    visits = pd.DataFrame([generate_visit(store_df, customer_df, ) for i in range(num_visits)])
    
    return visits

generate_visits(5, stores, customers, visit_date="01-01-2022")

Unnamed: 0,visit_id,visit_date,store_id,customer_id,order_total,payment_method
0,131296d9,01-01-2022,8626,66331c57,2.24,cash
1,0f9eda11,01-01-2022,3766,6661e294,774.79,credit
2,b6c9cb60,01-01-2022,1063,5c1dd3d3,488.49,cash
3,02d14d1b,01-01-2022,4872,a1274469,81.14,cash
4,7e184197,01-01-2022,8769,5d7df2e6,509.61,cash


Use functions to create seed_data()

In [12]:
from pathlib import Path

def seed_data(start_date, end_date, directory, num_stores, num_customers):
    
    Path(directory).mkdir(parents=True, exist_ok=True)
    
    stores = generate_stores(num_stores)
    stores.to_csv(f"{directory}/stores.csv", index=False)
    
    customers = generate_customers(num_customers)
    customers.to_csv(f"{directory}/customers.csv", index=False)
    
    visit_data = []
    
    for i in pd.date_range(start_date, end_date):
        visits = generate_visits(random.randrange(1, 10000), stores, customers, visit_date=i)
        visit_data.append(visits)
    
    pd.concat(visit_data).to_csv(f"{directory}/visits.csv", index=False)

seed_data("01-01-2020", "06-01-2020", "data/db", num_stores=50, num_customers=1500)
    
    