# Simulating Foot-Traffic Data With Faker

Faker is a Python package that allows you to generate fake data such as names, addresses, and phone numbers. It can be useful for generating test data for applications, populating databases with fake information, or anonymizing sensitive data. The package uses various localized data sources, such as lists of names and addresses specific to different countries, to generate the fake data. It also allows you to customize the generated data to a certain extent, for example, specifying the format of a phone number or the gender of a name.

## Importing the Faker package

In [6]:
from faker import Faker

fake = Faker()

Once the `fake` object has been initialized, we can use it to generate data values from dozens of categories, called "providers". You can find the full list of providers in the Faker [documentation](https://faker.readthedocs.io/en/master/providers.html).

In [2]:
# generate a fake name
print(fake.name())

#generate male and female names
print([fake.first_name_female(), fake.first_name_male()])

#generate a random date
print(fake.date())

#generate a realistic birthdate
print(fake.date_of_birth(minimum_age=13, maximum_age=100))

#generate fake address
print(fake.address())

#generate fake user profile data
print(fake.profile())

Sarah Hobbs
['Jennifer', 'Richard']
1970-09-03
1996-11-10
525 Burnett Forest
Lake Nicolemouth, IN 58979
{'job': 'Loss adjuster, chartered', 'company': 'Boyle PLC', 'ssn': '463-91-4584', 'residence': '8070 Sanders View Suite 799\nNorth Lindsayburgh, IN 18239', 'current_location': (Decimal('-22.0076515'), Decimal('164.111213')), 'blood_group': 'AB-', 'website': ['https://pope.com/', 'https://www.branch.com/', 'http://www.stanton-roberts.com/', 'https://boyd.com/'], 'username': 'kim10', 'name': 'Travis Spears', 'sex': 'M', 'address': '3590 Jason Ways\nPort Corey, CT 05227', 'mail': 'imoon@yahoo.com', 'birthdate': datetime.date(1920, 1, 31)}


## Using Faker to create the `stores` table

The `stores` table will be used to track the different store locations throughout the country, including their coordinates, city, and state. We want this data to be as realistic as possible so that we can map them later in the workshop, so we will be using the `Nominatim` package from the `geopy` library to...

Create the `generate_store()` function

In [3]:
fake.local_latlng()

('26.68451', '-80.66756', 'Belle Glade', 'US', 'America/New_York')

In [4]:
from geopy import Nominatim

locator = Nominatim(user_agent='myGeocoder')

def generate_store():
    
    coords = fake.local_latlng(country_code="US")
    location = locator.reverse(coords[:2]).raw
    
    try:
        city_town = location["address"]["city"]
    except:
        try:
            city_town = location["address"]["town"]
        except:
            city_town = location["address"]["county"]
    
        
    
    store = {
        "store_id": fake.pyint(),
        "opened_date": str(fake.date_this_century()),
        "latitude": coords[0],
        "longitude": coords[1],
        "store_address": " ".join([str(fake.pyint()), location["address"]["road"]]),
        "city": city_town,
        "state": location["address"]["state"]
    }
    
    return store

generate_store()

{'store_id': 1572,
 'opened_date': '2013-10-29',
 'latitude': '41.54566',
 'longitude': '-71.29144',
 'store_address': '297 West Main Road',
 'city': 'Middletown',
 'state': 'Rhode Island'}

Create `generate_stores()` function

In [5]:
import pandas as pd

def generate_stores(num_stores):
    
    stores = [generate_store() for i in range(num_stores)]
    
    return pd.DataFrame(stores)

generate_stores(5)

Unnamed: 0,store_id,opened_date,latitude,longitude,store_address,city,state
0,4392,2012-04-20,42.52787,-70.92866,4376 Central Street,Peabody,Massachusetts
1,7340,2018-07-12,40.64621,-73.97069,9983 Albemarle Road,City of New York,New York
2,1760,2018-07-23,37.73604,-120.93549,7891 Santa Fe Street,Riverbank,California
3,9536,2011-08-10,32.52515,-93.75018,2363 River Parkway,Shreveport,Louisiana
4,6895,2007-08-27,39.72943,-104.83192,7185 Vaughn Street,Aurora,Colorado


Create table of 50 stores

In [7]:
stores = generate_stores(50) #if you get a KeyError, run this cell again

map newly generated store locations

In [8]:
import folium

m = folium.Map(location=[39.8283, -98.5795], zoom_start=4)

for x,y in stores.iterrows():
    folium.Marker(location=[y.latitude, y.longitude], radius=5, tooltip=f"{y.city}, {y.state}").add_to(m)

m

### Exercise: 

#### Part 1
create the function `generate_customer()` that will generate a dictionary of customer data with the following attributes:

```
customer_id
customer_name
customer_birthday
customer_email
is_member
card_on_file
```

#### Part 2
Generate a CSV file containing 1500 customer records. Name the file `customers.csv`.


------
You can look through the Faker [documentation](https://faker.readthedocs.io/en/master/providers.html) to help you.


In [9]:
def generate_customer():
    
    customer = {
        "customer_id": fake.uuid4().split("-")[0],
        "customer_name": fake.name(),
        "customer_birthday": fake.date_of_birth(minimum_age=13, maximum_age=110),
        "customer_email": fake.email(),
        "is_member": fake.boolean(),
        "card_on_file": fake.credit_card_provider()
        
    }
    
    return customer

generate_customer()

{'customer_id': '8d555878',
 'customer_name': 'Kathy Swanson',
 'customer_birthday': datetime.date(1927, 1, 31),
 'customer_email': 'hardinlaura@liu.com',
 'is_member': True,
 'card_on_file': 'Diners Club / Carte Blanche'}

In [10]:
def generate_customers(num_customers):
    
    customers = [generate_customer() for i in range(num_customers)]
    
    return pd.DataFrame(customers)

generate_customers(5)

Unnamed: 0,customer_id,customer_name,customer_birthday,customer_email,is_member,card_on_file
0,0812cfe7,Aaron Porter,1973-10-01,nicolefitzgerald@taylor.info,False,VISA 16 digit
1,a4235fea,Erica Cox,1937-06-02,vincentfritz@yahoo.com,True,JCB 16 digit
2,7a12d4d8,Timothy Meyer,1945-04-20,brian49@garza-johnson.com,False,Discover
3,feea5f31,Justin Griffin,1912-06-27,desiree74@gmail.com,False,JCB 16 digit
4,83a63794,Dillon Martin,1927-09-20,kellydavid@doyle.com,False,VISA 16 digit


In [11]:
customers = generate_customers(500)
customers.describe()

Unnamed: 0,customer_id,customer_name,customer_birthday,customer_email,is_member,card_on_file
count,500,500,500,500,500,500
unique,500,500,496,500,2,10
top,7683165a,Carla Johnson,1919-03-02,elizabethali@haynes-stephens.com,True,JCB 16 digit
freq,1,1,2,1,278,95


Create the `generate_visits()` function

In [7]:
import random

def generate_visit(store_df, customer_df, visit_date="01-01-2022"):
    
    visit = {
        "visit_id": str(fake.uuid4().split("-")[0]),
        "visit_date": visit_date,
        "store_id": store_df.sample().store_id.values[0],
        "customer_id": customer_df.sample().customer_id.values[0],
        "order_total": round(random.random() * random.choice([10, 100, 500, 1000]), 2),
        "payment_method": random.choice(["cash", "credit"]),
    }
    
    return visit

generate_visit(stores, customers)

{'visit_id': '8d81e02a',
 'visit_date': '01-01-2022',
 'store_id': 8860,
 'customer_id': 'f81be72b',
 'order_total': 1.65,
 'payment_method': 'credit'}

Create `generate_visits()` function

In [8]:
def generate_visits(num_visits, store_df, customer_df, visit_date="01-01-2022"):
    
    def generate_visit(store_df, customer_df, visit_date=visit_date):
    
        visit = {
            "visit_id": str(fake.uuid4().split("-")[0]),
            "visit_date": visit_date,
            "store_id": store_df.sample().store_id.values[0],
            "customer_id": customer_df.sample().customer_id.values[0],
            "order_total": round(random.random() * random.choice([10, 100, 500, 1000]), 2),
            "payment_method": random.choice(["cash", "credit"]),
        }
        
        return visit
    
    visits = pd.DataFrame([generate_visit(store_df, customer_df, ) for i in range(num_visits)])
    
    return visits

generate_visits(5, stores, customers, visit_date="01-01-2022")

Unnamed: 0,visit_id,visit_date,store_id,customer_id,order_total,payment_method
0,bfc79689,01-01-2022,6273,ba528f94,460.35,cash
1,bbae5c30,01-01-2022,9417,7f686f55,339.99,cash
2,f4ef09e1,01-01-2022,1492,f20e1ae0,433.88,cash
3,beb7b2b3,01-01-2022,530,9e0948a6,0.49,credit
4,f12bf017,01-01-2022,8860,b9ae5cb2,898.85,credit


## Use functions to create seed_data()

In [15]:
from pathlib import Path

def seed_data(start_date, end_date, directory, num_stores, num_customers):
    
    Path(directory).mkdir(parents=True, exist_ok=True)
    
    stores = generate_stores(num_stores)
    stores.to_csv(f"{directory}/stores.csv", index=False)
    
    customers = generate_customers(num_customers)
    customers.to_csv(f"{directory}/customers.csv", index=False)
    
    visit_data = []
    
    for i in pd.date_range(start_date, end_date):
        visits = generate_visits(random.randrange(1, 10000), stores, customers, visit_date=i)
        visit_data.append(visits)
    
    pd.concat(visit_data).to_csv(f"{directory}/visits.csv", index=False)

seed_data("01-01-2022", "06-01-2022", "data/db", num_stores=50, num_customers=1500)
    
    

# BREAK - BACK TO WORKSHOP GUIDE

## Set up database for generated data

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html

In [1]:
# only run this cell if you no longer have the the stores, customers, and variables dataframes in your environment
import pandas as pd

customers = pd.read_csv("data/db/customers.csv")
stores = pd.read_csv("data/db/stores.csv")
visits = pd.read_csv("data/db/visits.csv")

In [2]:
from sqlalchemy import create_engine
engine = create_engine('sqlite:///data.db', echo=False)

customers.to_sql("customers", con=engine, index=False)
stores.to_sql("stores", con=engine, index=False)
visits.to_sql("visits", con=engine, index=False)

engine.dispose()

In [3]:
from sqlalchemy import text

with engine.connect() as conn:
    res = conn.execute(text("SELECT * FROM customers")).fetchall()

pd.DataFrame(res)

Unnamed: 0,customer_id,customer_name,customer_birthday,customer_email,is_member,card_on_file
0,fff10ac3,Jennifer Gaines,1949-04-10,mcasey@gmail.com,1,VISA 16 digit
1,89194075,Sara Washington,2004-03-31,banderson@scott.com,1,Mastercard
2,5f94728b,Jennifer Wilson,1918-07-19,destiny30@mendoza.net,0,VISA 16 digit
3,72310ca4,Courtney Fuller,1912-10-27,mgriffith@clayton.net,1,JCB 16 digit
4,10121b9c,Michelle Weber,1982-11-25,bettyalexander@marshall.com,0,Diners Club / Carte Blanche
...,...,...,...,...,...,...
1495,639a407f,Dennis Washington DDS,1940-04-20,kevinrice@mcintosh.com,1,VISA 19 digit
1496,e993af0e,Thomas Fox,1944-02-21,trevor62@garcia.info,1,VISA 16 digit
1497,e6ea9a30,Angel Webster,1934-01-14,qjones@gmail.com,1,Mastercard
1498,ba28162a,Zoe Miller,1980-01-22,rebecca13@hotmail.com,1,Discover


# Back to Workshop Guide - Create Live App

## Send new data to database

In [9]:
import time
from sqlalchemy import create_engine
import pandas as pd
# engine.dispose()

def generate_data(db_engine, start_date, end_date, time_delay=2):
    with db_engine.connect() as conn:
        customers = pd.read_sql("customers", conn)
        stores = pd.read_sql("stores", conn)
        
        for i in pd.date_range(start_date, end_date):
            visits = generate_visits(random.randrange(1, 10000), stores, customers, visit_date=i)
            visits.to_sql("visits", con=db_engine, if_exists='append', index=False)
            print(f"inserted {len(visits)} records from {str(i)}")
            print("---")
            time.sleep(time_delay)
        

engine = create_engine("sqlite:///data.db", echo=False)
generate_data(engine, "2022-06-01", "2022-12-31")

inserted 9884 records from 2022-06-01 00:00:00
---
inserted 7059 records from 2022-06-02 00:00:00
---
inserted 5351 records from 2022-06-03 00:00:00
---
inserted 2216 records from 2022-06-04 00:00:00
---
inserted 872 records from 2022-06-05 00:00:00
---
inserted 550 records from 2022-06-06 00:00:00
---
inserted 2105 records from 2022-06-07 00:00:00
---
inserted 2523 records from 2022-06-08 00:00:00
---
inserted 3124 records from 2022-06-09 00:00:00
---
inserted 756 records from 2022-06-10 00:00:00
---
inserted 9157 records from 2022-06-11 00:00:00
---
inserted 8535 records from 2022-06-12 00:00:00
---
inserted 4148 records from 2022-06-13 00:00:00
---
inserted 7212 records from 2022-06-14 00:00:00
---
inserted 6475 records from 2022-06-15 00:00:00
---
inserted 4068 records from 2022-06-16 00:00:00
---
inserted 3308 records from 2022-06-17 00:00:00
---
inserted 3526 records from 2022-06-18 00:00:00
---
inserted 2212 records from 2022-06-19 00:00:00
---
inserted 8028 records from 2022-06

KeyboardInterrupt: 

In [15]:
from sqlalchemy import text
# engine = create_engine("sqlite:///data.db", echo=False)
with engine.connect() as conn:
    test = conn.execute(text("SELECT * FROM visits")).fetchall()

len(test)

909744