<a href="https://colab.research.google.com/github/Collinsnwoye/Data-Ingestion/blob/main/Synthetictest1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Synthetic Data Transformation!
-------
This example demonstrates a basic ETL(Extract, Transform, Load) using Fakers library in Python. It generates data by replicating the statistical properties of actual data without the accurate data’s
identifying properties(Fakers library), applies a simple transformation specifically to each reocrds phone number, and loads the results into a Pandas DataFrame display.

##Below is a high-level overview of the key functions used in this ETL pipeline:

* pip install Faker
  Not included in the Python Standard Library, then you install.

* generate_fake_people()
  Creates a list of synthetic data, each with a fake id and more. This simulates incoming raw data.

* transform_data(batch)
  Transform each records Phone_number to a specific country code(UK) by filtering the numbers and joinin the requierd code(+44)

* (Transforming into batches won't be needed cuase the data is a relatively small amount)

* load_data(batch)
Simulates loading all the transformed records into a database(in this case, a Pandas DataFrame).

* main()
  Orchestrates the ETL flow: generates_data, transforms_data, and load_data it.

  ------

  # Installing Faker Library.


In [1]:
pip install faker

Collecting faker
  Downloading faker-37.4.0-py3-none-any.whl.metadata (15 kB)
Downloading faker-37.4.0-py3-none-any.whl (1.9 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.9 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━[0m [32m1.1/1.9 MB[0m [31m31.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m30.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faker
Successfully installed faker-37.4.0


#### It installs the Faker Library to a temporary runtime environment. It’s available immediately, but it won’t persist after the session ends unless you re-run the install.

-------
# Generate Synthetic Data Function

In [20]:
from faker import Faker

fake = Faker()
def generate_fake_people():
    fake_people = []


    for i in range(3):
        person = {
            "Full_name": fake.name(),
            "Phone_number": fake.phone_number(),
            "Email_address": fake.email(),
            "Job_title": fake.job(),
            "City": fake.city()
        }
        fake_people.append(person)
        print(fake_people)


    return 6


people_data = generate_fake_people()

print(people_data)

[{'Full_name': 'Mark Harris', 'Phone_number': '001-674-274-4138x026', 'Email_address': 'mariosherman@example.com', 'Job_title': 'Cartographer', 'City': 'Port Robertchester'}]
[{'Full_name': 'Mark Harris', 'Phone_number': '001-674-274-4138x026', 'Email_address': 'mariosherman@example.com', 'Job_title': 'Cartographer', 'City': 'Port Robertchester'}, {'Full_name': 'Ethan Giles', 'Phone_number': '001-461-524-1658x01438', 'Email_address': 'alexandraarellano@example.net', 'Job_title': 'Rural practice surveyor', 'City': 'New Ronaldtown'}]
[{'Full_name': 'Mark Harris', 'Phone_number': '001-674-274-4138x026', 'Email_address': 'mariosherman@example.com', 'Job_title': 'Cartographer', 'City': 'Port Robertchester'}, {'Full_name': 'Ethan Giles', 'Phone_number': '001-461-524-1658x01438', 'Email_address': 'alexandraarellano@example.net', 'Job_title': 'Rural practice surveyor', 'City': 'New Ronaldtown'}, {'Full_name': 'Tara Jackson', 'Phone_number': '8739378368', 'Email_address': 'larry84@example.org',

In [18]:
X = []
type(X)

list

In [None]:
def generate_fake_people():
/    fake_people = []

    for i in range(100):
        person = {
            "Full_name": fake.name(),
            "Phone_number": fake.phone_number(),
            "Email_address": fake.email(),
            "Job_title": fake.job(),
            "City": fake.city()

#### Design Considerations for Synthetic Data Generator Function: When building a function to generate sample data, it’s important to consider structure, flexibility, and practical processing needs. Here's a breakdown of the design choices we made:

1. Data Structure Choice : Each data record is represented as a Python dictionary, storing key-value pairs. This mirrors tabular data with named columns, making records self-describing and easy to manipulate or convert.

2. Using a List as a Container : All generated records are stored in a list, which maintains order and facilitates iteration, slicing, and crucial for downstream tasks like transformation or loading.

3. Record Generation Loop : A simple for loop runs for the requested number of records, appending newly created dictionaries to the list, progressively building the dataset.

4. Function Purpose This function acts as a lightweight simulated data source, useful for testing data pipelines, and experimenting with transformation logic without relying on external databases.

------

# Transform Data Function

In [None]:
import re

def standardize_phone_numbers(data):
    for person in data:
        raw_phone = person["Phone_number"]

        digits_only = re.sub(r'\D', '', raw_phone)

        if digits_only.startswith('44'):
            digits_only = digits_only[2:]
        elif digits_only.startswith('0'):
            digits_only = digits_only[1:]
        elif digits_only.startswith('234'):
            digits_only = digits_only[3:]
        elif digits_only.startswith('00'):
            digits_only = digits_only[2:]

        digits_only = digits_only[:10].zfill(10)
        person["Phone_number"] = '+44' + digits_only

    return data
people_data = standardize_phone_numbers(people_data)

for person in people_data[:20]:
    print(person["Phone_number"])

#### Design Considerations for transform data Function:

### For Loop Iteration : Works on each record in an orderly manner while it simultaneously removes all non digit characters using a regular expression.

### Calculated removal of known prefixes-such as country codes or leading zeros-to isolate the core number.

### Finally, it trims or pads the number to enusre it's exactly 10 digits long, and prepends the uk country code +44 for standardization.

---------

 # Load Data Function


In [None]:
import pandas as pd

def load_data(fake_people, filename="fake_people_data.csv", index=False, encoding="utf-8"):
    df = pd.DataFrame(fake_people)

    df.to_csv("fake_people_data.csv", index=False, encoding="utf-8")

    return df
df_people = load_data(people_data)

print(df_people.head(20))

### Design Considerations for load data Function:

1. Structured Output with Pandas DataFrame The function converts the list of transformed records into a pandas.DataFrame, creating a structured, tabular representation of the data. This format is widely used in data workflows for its flexibility and powerful analysis capabilities.

2. Data Persistence via CSV Export The function saves the DataFrame to a CSV file using df.to_csv(). This makes the data portable and easy to inspect, share, or feed into other tools and systems. Using index=False avoids writing the DataFrame's row index as a column in the CSV, keeping the file clean.

3. Function Return Value After saving, the function returns the DataFrame, allowing further use in our process (e.g., visualization, validation, or further transformation).

-------

# The Main Function



In [None]:
def main():
    num_records = 100
    data = generate_fake_people()
    processed_data = standardize_phone_numbers(data)
    df = load_data(people_data, "fake_people_data.csv", index=False, encoding="utf-8")

    print("Data processing complete. The processed data is saved in fake_data.csv.")
    return df

if __name__ == "__main__":
    main()


### Design Considerations for main Function:

1. Pipeline Coordination The main() function serves as the central controller for the ETL (Extract → Transform → Load) process. It organizes the execution flow, making the script modular, readable, and maintainable.

2. Final Loading Step Once all data is transformed, whole_data is passed to load_data() for conversion to a structured format (DataFrame) and optional persistence (e.g., to CSV). Returning the DataFrame allows flexibility for downstream tasks like visualization or further processing.

### Final Consideration: By delegating each step to its own function (generate_sample_data, process_data, transform_data, load_data), main() promotes a clean and modular design. This makes the codebase easier to test, debug, and scale.
