Objective Statement

1. The objective of this project is to generate and manipulate synthetic data representing the operations of Ghana Rubber Estates Limited (GREL) using the Faker library. This involves creating 100,000 records with fields including 'Production Date', 'Production Quantity (kg)', 'Production Location', 'Production Cost', and 'Production Hours'.

2. The data will then be ingested into a relational database system such as Postgres or MySQL using a structured pipeline. Following ingestion, the pipeline will facilitate querying of the data to answer at least 10 questions related to GREL's production activities. Additionally, the project will document these queries in a separate file and visually represent the pipeline through a diagram. 

This initiative aims to simulate real-world data management scenarios and provide insights into the operations of GREL for analytical purposes.

In [1]:
# installing faker Library

pip install faker

Note: you may need to restart the kernel to use updated packages.


In [9]:
# Importing of libraries 

from faker import Faker
import pandas as pd
import numpy as np

In [10]:
# Initializing Faker to generate fake data

fake = Faker()

In [11]:
# Defining cities within each region of production

# The main production sites are typically located in the Eastern, Western, and Central regions of Ghana. 

cities_by_region = {
    'Eastern Region': ['Bunso', 'Kade'],
    'Western Region': ['Daboase', 'Agona'],
    'Central Region': ['Abura Dunkwa']
}


In [13]:
# Defining a function to generate production data for Ghana Rubber Estates Limited

def generate_production_data(num_records):
    
    data = []
    
    for _ in range(num_records):
        
        # Generate production date
        production_date = fake.date_between(start_date='-1y', end_date='today')  # Production date within the last year
        
        # Generate production quantity (in kilograms)
        production_quantity = fake.random_number(digits=4)  # Random 4-digit number for production quantity
        
        # Generate production location (plantation) within the specified cities and regions
        region = fake.random_element(elements=list(cities_by_region.keys()))  # Randomly select a region
        city = fake.random_element(elements=cities_by_region[region])  # Randomly select a city within the region
        production_location = f"{city}, {region}"  # Format production location
        
        
        # Generate production cost (in local currency)
        production_cost = fake.random_number(digits=5)  # Random 5-digit number for production cost
        
        # Generate production hours
        production_hours = fake.random_number(digits=2)  # Random 2-digit number for production hours
        
        # Append the generated data to the list
        data.append({
            'Production Date': production_date,
            'Production Quantity (kg)': production_quantity,
            'Production Location': production_location,
            'Production Cost': production_cost,
            'Production Hours': production_hours
        })
    
    return data



In [14]:
# Define the number of records to generate
num_records = 100000



In [15]:
# Generate production data for Ghana Rubber Estates Limited
production_data = generate_production_data(num_records)



In [16]:
# Create a DataFrame from the generated data
df = pd.DataFrame(production_data)



In [17]:
# Save the DataFrame to a CSV file
df.to_csv('ghana_rubber_production_data.csv', index=False)

print("Production data generation completed. File 'ghana_rubber_production_data.csv' saved.")

Production data generation completed. File 'ghana_rubber_production_data.csv' saved.


In [18]:
df = pd.read_csv("ghana_rubber_production_data.csv")

In [19]:
df

Unnamed: 0,Production Date,Production Quantity (kg),Production Location,Production Cost,Production Hours
0,2024-02-07,1099,"Kade, Eastern Region",53479,4
1,2024-02-19,2640,"Abura Dunkwa, Central Region",32618,93
2,2023-12-01,9498,"Daboase, Western Region",58971,66
3,2023-12-06,5398,"Daboase, Western Region",16828,27
4,2023-09-24,461,"Daboase, Western Region",58599,77
...,...,...,...,...,...
99995,2023-06-25,4246,"Agona, Western Region",24418,70
99996,2023-04-15,3195,"Daboase, Western Region",25297,91
99997,2023-07-16,2192,"Abura Dunkwa, Central Region",436,71
99998,2023-05-09,501,"Abura Dunkwa, Central Region",96369,98


In [12]:
df.columns

Index(['Production Date', 'Production Quantity (kg)', 'Production Location',
       'Production Cost', 'Production Hours'],
      dtype='object')