# Creating A Dataset

## `Ice Cream Dataset`

### Import necessary libraries

In [132]:
from faker import Faker

Here *faker* (with the small f) is the **main** module and *Faker* (with the capital F) is the **sub** module.

In [133]:
import pandas as pd  # To create the dataframe
import random        # To randomly pick between available choices

In [134]:
fake = Faker()

The object `fake` is created for the module `Faker`

### Now to create the dataset

In [146]:
# All the available flavors of ice cream are listed
options = ['Vanilla','Butterscotch','Strawberry','Chocolate','Black Current',
           'Peanut Butter','Pistachio','Tutti Frutti','Mango','Chocolate Chip']

# The amount for each ice cream is set
# Here Data dependency(one piece of data relies on another piece to make sense) is present
# amt depends on the ice cream flavors because it maps each flavor to a price.
amt = {
    'Vanilla':25,
    'Butterscotch':35,
    'Strawberry':25,
    'Chocolate':30,
    'Black Current':35,
    'Peanut Butter':45,
    'Pistachio':35,
    'Tutti Frutti':35,
    'Mango':30,
    'Chocolate Chip':40
}

# Following are the different payment modes available
modes = ['Cash','Card','GPay','PhonePe','PayPal']

Now we create a function called `create_dataset` which will help create the dataset for us. This function takes one argument i.e `n` ,which specifies the number of rows/customers we wish to create.

In [143]:
def create_dataset(n):
    table = []
    for i in range(n):
        row = {}
        row['order_id'] = f"ORD-{i+1:05d}"  # `i` is the The integer that is being formated
                                            # `d` Format as a decimal integer
                                            # `5` Total width = 5 characters
                                            # `0` Pad with leading zeros
                                            # The order_id s created will appear in the form ORD-00001,ORD-00002 etc
        row['date'] = fake.date_between('-2y','today')               # Dates are generated randomly from 2 years back till today's date.
        row['time'] = fake.time()                                   # Random time is generated which is in the format HH:MM:SS
        row['customer_name'] = fake.name()                         # Random names are generated
        row['customer_phno'] = fake.numerify(text='9#########')   # This helps generate random numbers,all starting with 9,to create phone numbers
        f = random.choice(options)                               # A random flavor is selected from the available options
                                                                # Since I will need to know which flavor was choosen,to later figure out the appropriate price,I store it in a variable 'f' 
        row['flavor'] = f                                          
        row['price'] = amt[f]                                  # The price of the flavor is selected accordingly
        row['payment_mode'] = random.choice(modes)            # A random mode of payment is chosen

        table.append(row)
    return pd.DataFrame(table)

**Explanation of the above created function**

- The function `create_dataset` takes one argument `n` which specifies the number of rows/customers we wish to create.
- Before the dataset is converted to a `dataframe` format ,we shall store it as a list (which will go by the name `table`).
- This list will eventually hold dictionaries (one for each row of data).
- Now n rows for the dataset is created using the for loop which will iterate `n` times.
- Each iteration generates one row of data.
- Each row is created and stored in the form of a `dictionary`.
- For each row,the values "order_id,date,time,customer_name,customer_phno,flavor,price,payment_mode" will be created.
- During the end of each iteration,that particular row will be appended into the `'table'` list.
- After looping through 'n' times, the list of dictionaries is converted into a pandas `'DataFrame'`, where each dictionary becomes a row and each key becomes a column.
- And this dataframe is returned!

Columns: order_id, date, time, customer_name, customer_phno, flavor, price, payment_mode,will auomatically be created,using the pandas library, for the dataframe when we run `pd.DataFrame(table)`

**Question**: Won't it be a waste of memory to create the exact same key values over and over again for each dictionary (row) ?

**Answer**

No,it isn't. Python usually stores the string only once in memory. Each dictionary then has a reference (a pointer) to that same string object for its key. So the key itself is not duplicated for every dictionary.The values in each dictionary (like 'John Doe' or 'ORD-00001') are usually unique objects, so each value takes up its own memory.  

So if a `key` has multiple values corresponding to it,they all reference (or point to) the same variable and memory is not created for the same key multiple times,it is created once. 

**Note:** These dictionary keys directly become the DataFrame’s column names, and the values become the row entries.

### The Dataset 

In [167]:
df = create_dataset(100)
df

Unnamed: 0,order_id,date,time,customer_name,customer_phno,flavor,price,payment_mode
0,ORD-00001,2024-08-31,20:58:20,James Dixon,9243016748,Black Current,35,PayPal
1,ORD-00002,2025-03-19,13:21:48,Angela Lopez,9576521909,Chocolate Chip,40,GPay
2,ORD-00003,2025-10-17,09:58:58,Destiny Davis,9791529488,Black Current,35,GPay
3,ORD-00004,2025-09-03,12:35:13,Brenda Kemp,9617226992,Peanut Butter,45,PayPal
4,ORD-00005,2025-03-15,15:43:20,Bryan Fields,9903447983,Peanut Butter,45,PayPal
...,...,...,...,...,...,...,...,...
95,ORD-00096,2025-07-15,09:55:33,Jacob Cunningham,9576718977,Butterscotch,35,Card
96,ORD-00097,2024-05-29,11:46:24,Allison Hernandez,9494836067,Peanut Butter,45,PayPal
97,ORD-00098,2024-05-17,08:14:26,Barbara Newman,9417566267,Pistachio,35,PayPal
98,ORD-00099,2024-07-20,20:33:16,Amber Morrison,9895905210,Peanut Butter,45,Card


### The dataset is exported as `.csv` file

In [165]:
df.to_csv('Ice_Cream_Dataset.csv')