# **Mimesis Data Generator**

Mimesis is a robust data generator for Python that can produce a wide range of synthetic data in various languages. This tool is useful for populating testing databases, creating fake API endpoints, filling pandas DataFrames, generating JSON and XML files with custom structures, and anonymizing production data, among other purposes.

The purpose of this notebook is to attempt to create a sythetic AML dataset from a financial institution in Australia.

The first step is to install the mimesis library.

In [None]:
!pip install mimesis

Collecting mimesis
  Downloading mimesis-15.1.0-py3-none-any.whl (4.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.5/4.5 MB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: mimesis
Successfully installed mimesis-15.1.0


The code below provides a list of attributes and methods in the object named 'Fake' from the generic provider.

In [None]:
from mimesis import Generic
from mimesis.locales import Locale

fake = Generic(Locale.EN_AU)
print(dir(fake))

['address', 'binaryfile', 'choice', 'code', 'cryptographic', 'datetime', 'development', 'file', 'finance', 'food', 'hardware', 'internet', 'numeric', 'path', 'payment', 'person', 'science', 'text', 'transport']


The code below shows the methods within the address method.

In [None]:
address_methods = dir(fake.address)
print(address_methods)

['Meta', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_dataset', '_dd_to_dms', '_extract', '_get_fs', '_has_seed', '_load_dataset', '_override_locale', '_read_global_file', '_setup_locale', '_update_dict', 'address', 'calling_code', 'city', 'continent', 'coordinates', 'country', 'country_code', 'country_emoji_flag', 'default_country', 'federal_subject', 'get_current_locale', 'isd_code', 'latitude', 'locale', 'longitude', 'override_locale', 'postal_code', 'prefecture', 'province', 'random', 'region', 'reseed', 'seed', 'state', 'street_name', 'street_number', 'street_suffix', 'update_dataset', 'validate_enum', 'zip_code']


A dataframe is created by calling different providers and the methods and a for loop is used to iterate over 1000 samples. An extra bit of code is added below the dataframe to ensure that only positive values are generated for transaction amounts and also, only unique transaction amounts.

In [None]:
import pandas as pd
import numpy as np

df = pd.DataFrame(
   [
       {
           "customer_id": np.random.randint(1, 1000000),
           "name": fake.person.full_name(),
           "email": fake.person.email(),
           "occupation": fake.person.occupation(),
           "bank_1": fake.finance.bank(),
           "bank_2": fake.finance.bank(),                                       #Could be the same bank.
           "transaction_date": fake.datetime.date(),
           "transaction_amount": round(max(0.01, round(fake.numeric.decimal_number(), 2)), 2),
           "transaction_type": np.random.choice(['PayID', 'BSB', 'Internal']),
           "payment_type": np.random.choice(['Credit', 'Debit']),
       }
       for _ in range(1000)
   ]
)

# Ensure unique transaction amounts
generated_amounts = set()
for i, amount in enumerate(df['transaction_amount']):
    while amount in generated_amounts:
        amount = round(max(0.01, round(fake.numeric.decimal_number(), 2)), 2)
    df.at[i, 'transaction_amount'] = amount
    generated_amounts.add(amount)

df.head()


Unnamed: 0,customer_id,name,email,occupation,bank_1,bank_2,transaction_date,transaction_amount,transaction_type,payment_type
0,44077,Nohemi Floyd,today1886@yandex.com,Steel Erector,ING Bank Limited,Northern Beaches Credit Union Ltd,2008-07-30,465.81,BSB,Credit
1,667000,Dani Walter,missouri1833@yandex.com,Systems Engineer,Police Bank Ltd,Community First Credit Union Limited,2013-11-02,381.47,BSB,Debit
2,816283,Shaun Grimes,award1994@duck.com,Stonemason,WAW Credit Union Co-operative Limited,Maitland Mutual Building Society Ltd,2010-04-30,419.67,PayID,Credit
3,712287,Billy Forbes,massage1831@live.com,Furniture Restorer,P&N Bank,Auswide Bank Ltd,2010-09-22,351.19,BSB,Credit
4,592906,Angle Marquez,promoted1860@gmail.com,Barber,Bank of Queensland Limited,G&C Mutual Bank,2013-02-27,517.62,PayID,Credit


### **Discussion**

The strength of Mimesis lies in the ability to generate data in many languages and to access many data providers specific to the language selected. An example given in the dataset created in this notebook is bank names.

Another strength is it's comparably faster run time and ability to produce more unique data (compared with Faker) (https://mimesis.name/en/master/about.html).

However, similar to Faker, there are limitations to Mimesis such as it's lack of ability to add arguments to built in functions. To create custom data fields, new functions need to be created which can be time consuming.

One other issue is it's ability to create data which resemebles the correct statistical properties of a similar dataset. For example, randomising certain features such as or transaction type will likely follow standard distributions but this may not always be the case. There could be skewness or other distributions which the data may follow so it is important to consider this when generating data.

