# Simulating Sales Data

### Customer Information Data Set
- Customer ID
- First Name
- Last Name
- Age
- Gender (Female, Male, Non-Binary, Prefer Not to Say)
- Zip Code Location
- Custoemr Lifetime Value (CLV)
- Customer Segment (Through analysis: high spenders, occasional buyers, loyal customers)
- Feedback and Ratings

### Sales Transaction Data
- Transaction ID
- Transaction Date Time
- Type (Sale or Refund)
- Store ID
- Customer ID
- Purchase Total Amount
- Payment Method (Credit Card, Debit Card, Gift Card, Cash)
- Coupon/Promotion Code In Order

### Sales Transaction Details
- Transaction ID
- Product ID
- Quantity Purchased
- Coupon/Promotion Code Used

### Product Details
- Product ID
- Product Description
- Product Category (electronics, clothing, groceries)


In [34]:
# Imports

import re
import pandas as pd
import numpy as np

---
# Customer Information

#### Gender Neutral Names
https://www.emmasdiary.co.uk/baby-names/our-top-300-unisex-baby-names
#### Last Names
https://www.rong-chang.com/namesdict/100_last_names.htm#google_vignette

In [67]:
# Open the file in read mode
with open('Data/GenderNeutralNames.csv','r')as file:
    # read the names from the file and remove new line characters
    excel_list = [line.strip() for line in file.readlines()]
    names = [row.replace('Â\xa0','').replace(" ","").split('.')[-1] for row in excel_list if any(char.isdigit() for char in row)]
    
#print(names)
names

['Addison',
 'Adrian',
 'Aiden',
 'Ainsley',
 'Alex',
 'Alfie',
 'Ali',
 'Amory',
 'Andie',
 'Andy',
 'Angel',
 'Archer',
 'Arden',
 'Ari',
 'Ariel',
 'Armani',
 'Arya',
 'Ash',
 'Ashley',
 'Ashton',
 'Aspen',
 'Athena',
 'Aubrey',
 'Auden',
 'August',
 'Avery',
 'Avis',
 'Bailey',
 'Baker',
 'Bay',
 'Bellamy',
 'Bergen',
 'Bevan',
 'Billie',
 'Billy',
 'Blaine',
 'Blair',
 'Blake',
 'Blue',
 'Bobby',
 'Bowie',
 'Brady',
 'Brennan',
 'Brent',
 'Brett',
 'Briar',
 'Brighton',
 'Britton',
 'Brooke',
 'Brooklyn',
 'Brooks',
 'Caelan',
 'Cameron',
 'Campbell',
 'Carey',
 'Carmel',
 'Carmen',
 'Carroll',
 'Carson',
 'Carter',
 'Casey',
 'Cassidy',
 'Chance',
 'Channing',
 'Charley',
 'Charlie',
 'Chris',
 'Clay',
 'Clayton',
 'Cody',
 'Cole',
 'Corey',
 'Dakota',
 'Dale',
 'Dallas',
 'Dana',
 'Dane',
 'Darby',
 'Daryl',
 'Dawson',
 'Delta',
 'Denver',
 'Devin',
 'Dorian',
 'Drew',
 'Dylan',
 'Easton',
 'Eddie',
 'Eden',
 'Elliott',
 'Ellis',
 'Ellison',
 'Ember',
 'Emerson',
 'Emery',
 'Emo

In [68]:
# Open the file in read mode
with open('Data/LastNames.csv','r')as file:
    # read the names from the file and remove new line characters
    excel_list = [line.strip() for line in file.readlines()]
# Process Names
last_names = []
for row in excel_list:
    # remove unwanted characters (Â\xa0, extra spaces, and consecutive commas)
    cleaned_row = row.replace('Â\xa0', '').replace(" ", "").replace(',,', '')
    
    # Extract the text after the last digit
    if any(char.isdigit() for char in cleaned_row):
        last_digit_idx = max([i for i, char in enumerate(cleaned_row) if char.isdigit()])
        extracted_name = cleaned_row[last_digit_idx+1:]
        # add the name to the last name list
        last_names.append(extracted_name)
        
print(last_names)

['Smith', 'Johnson', 'Williams', 'Jones', 'Brown', 'Davis', 'Miller', 'Wilson', 'Moore', 'Taylor', 'Anderson', 'Thomas', 'Jackson', 'White', 'Harris', 'Campbell', 'Parker', 'Evans', 'Edwards', 'Collins', 'Stewart', 'Sanchez', 'Morris', 'Rogers', 'Reed', 'Cook', 'Morgan', 'Bell', 'Murphy', 'Bailey', 'Rivera', 'Cooper', 'Richardson', 'Cox', 'Martin', 'Thompson', 'Garcia', 'Martinez', 'Robinson', 'Clark', 'Rodriguez', 'Lewis', 'Lee', 'Walker', 'Hall', 'Allen', 'Young', 'Hernandez', 'King', 'Howard', 'Ward', 'Torres', 'Peterson', 'Gray', 'Ramirez', 'James', 'Watson', 'Brooks', 'Kelly', 'Sanders', 'Price', 'Bennett', 'Wood', 'Barnes', 'Ross', 'Henderson', 'Coleman', 'Jenkins', 'Wright', 'Lopez', 'Hill', 'Scott', 'Green', 'Adams', 'Baker', 'Gonzalez', 'Nelson', 'Carter', 'Mitchell', 'Perez', 'Roberts', 'Turner', 'Phillips', 'Perry', 'Powell', 'Long', 'Patterson', 'Hughes', 'Flores', 'Washington', 'Butler', 'Simmons', 'Foster', 'Gonzales', 'Bryant', 'Alexander', 'Russell', 'Griffin', 'Diaz', 

In [69]:
zips = pd.read_csv('Data/uszips.csv')
continental_zips_df = zips[~zips['state_name'].isin(['Puerto Rico','Virgin Islands'])]
continental_zips_df['zip_6'] = continental_zips_df['zip'].apply(lambda x: '{:05}'.format(x)) #astype(str).str.zfill(5)
zip_list = list(continental_zips_df.zip_6)
#zip_list

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  continental_zips_df['zip_6'] = continental_zips_df['zip'].apply(lambda x: '{:05}'.format(x)) #astype(str).str.zfill(5)


In [77]:
# set a random state for consistent results
np.random.seed(42)

# Assuming 1000 unique customers

# Number of records in the dataset
num_records = 1000
num_cust = 200


# Generate data for customer information
customer_data = {
    'Customer_ID': np.arange(1,num_cust+1), # 200 customers
    'First_Name': np.random.choice(names,size=num_cust,replace=True),
    'Last_Name': np.random.choice(last_names,size=num_cust,replace=True),
    'Age': np.random.randint(18,85, size = num_cust),
    'Gender': np.random.choice(['Male','Female','Non-Binary','Prefer Not To Say'],
                               size = num_cust,
                               replace = True,
                               p= [0.45,0.45,0.05,0.05]),
    'Location': np.random.choice(zip_list, size=num_cust, replace=True)    
}

# generate sales data

In [78]:
customer_data_df = pd.DataFrame(customer_data)
customer_data_df.head(3)

Unnamed: 0,Customer_ID,First_Name,Last_Name,Age,Gender,Location
0,1,Flynn,Kelly,23,Male,7718
1,2,Terry,Thompson,64,Male,39355
2,3,Gene,Edwards,72,Female,6850
