The Python module developed for the Orange Python Tool Plugin serves a dual purpose, providing functionalities for both image processing using OpenCV and the generation of synthetic datasets through the Faker library.
The generate_fake_dataset
function is designed to generate a fake dataset based on the provided configuration using the Faker library. This can be useful for testing, development, or generating sample data.
generate_fake_dataset(num_rows, path="fake_dataset.csv", none_values=True, table_head=None)
num_rows
: Number of rows to generate in the dataset.path
: The path where the generated dataset will be saved. Default is "fake_dataset.csv".none_values
: Whether to include None values in the generated dataset. Default isTrue
.table_head
: A dictionary specifying the columns and their configurations.
The table_head
dictionary defines the columns of the dataset, where each key is the column name, and the value is a dictionary specifying the column configuration. Supported configurations include:
type
: Faker provider type (e.g., "name", "email", "passport_gender").dtype
: Data type (e.g., "int").range
: Range of values for integer columns (e.g., "1-100").len
: Length of integer columns.
import pandas as pd
from faker import Faker
# Define the column configurations
table_head = {
"name": {"type": "name"},
"email": {"type": "email"},
"gender": {"type": "passport_gender"},
"age": {"dtype": "int", "range": "10-100"},
"score": {"dtype": "int", "range": "10-100"},
"time_s": {"dtype": "int", "range": "10-100"},
}
# Generate a fake dataset with 500 rows and save it to a CSV file
generate_fake_dataset(500, "sample.csv", True, table_head)
The Faker
library provides a wide range of providers for generating fake data. Some of the supported types include:
Types = [ "aba", "add_provider", "address", "administrative_unit", "am_pm", "android_platform_token", "ascii_company_email", "ascii_email", "ascii_free_email", "ascii_safe_email", "bank_country", "basic_phone_number", "bban", "binary", "boolean", "bothify", "bs", "building_number", "cache_pattern", "catch_phrase", "century", "chrome", "city", "city_prefix", "city_suffix", "color", "color_hsl", "color_hsv", "color_name", "color_rgb", "color_rgb_float", "company", "company_email", "company_suffix", "coordinate", "country", "country_calling_code", "country_code", "credit_card_expire", "credit_card_full", "credit_card_number", "credit_card_provider", "credit_card_security_code", "cryptocurrency", "cryptocurrency_code", "cryptocurrency_name", "csv", "currency", "currency_code", "currency_name", "currency_symbol", "current_country", "current_country_code", "date", "date_between", "date_between_dates", "date_object", "date_of_birth", "date_this_century", "date_this_decade", "date_this_month", "date_this_year", "date_time", "date_time_ad", "date_time_between", "date_time_between_dates", "date_time_this_century", "date_time_this_decade", "date_time_this_month", "date_time_this_year", "day_of_month", "day_of_week", "del_arguments", "dga", "domain_name", "domain_word", "dsv", "ean", "ean13", "ean8", "ein", "email", "emoji", "enum", "factories", "file_extension", "file_name", "file_path", "firefox", "first_name", "first_name_female", "first_name_male", "first_name_nonbinary", "fixed_width", "format", "free_email", "free_email_domain", "future_date", "future_datetime", "generator_attrs", "get_arguments", "get_formatter", "get_providers", "hex_color", "hexify", "hostname", "http_method", "iana_id", "iban", "image", "image_url", "internet_explorer", "invalid_ssn", "ios_platform_token", "ipv4", "ipv4_network_class", "ipv4_private", "ipv4_public", "ipv6", "isbn10", "isbn13", "iso8601", "items", "itin", "job", "json", "json_bytes", "language_code", "language_name", "last_name", "last_name_female", "last_name_male", "last_name_nonbinary", "latitude", "latlng", "lexify", "license_plate", "linux_platform_token", "linux_processor", "local_latlng", "locale", "locales", "localized_ean", "localized_ean13", "localized_ean8", "location_on_land", "longitude", "mac_address", "mac_platform_token", "mac_processor", "md5", "military_apo", "military_dpo", "military_ship", "military_state", "mime_type", "month", "month_name", "msisdn", "name", "name_female", "name_male", "name_nonbinary", "nic_handle", "nic_handles", "null_boolean", "numerify", "opera", "optional", "paragraph", "paragraphs", "parse", "passport_dates", "passport_dob", "passport_full", "passport_gender", "passport_number", "passport_owner", "password", "past_date", "past_datetime", "phone_number", "port_number", "postalcode", "postalcode_in_state", "postalcode_plus4", "postcode", "postcode_in_state", "prefix", "prefix_female", "prefix_male", "prefix_nonbinary", "pricetag", "profile", "provider", "providers", "psv", "pybool", "pydecimal", "pydict", "pyfloat", "pyint", "pyiterable", "pylist", "pyobject", "pyset", "pystr", "pystr_format", "pystruct", "pytimezone", "pytuple", "random", "random_choices", "random_digit", "random_digit_above_two", "random_digit_not_null", "random_digit_not_null_or_empty", "random_digit_or_empty", "random_element", "random_elements", "random_int", "random_letter", "random_letters", "random_lowercase_letter", "random_number", "random_sample", "random_uppercase_letter", "randomize_nb_elements", "rgb_color", "rgb_css_color", "ripe_id", "safari", "safe_color_name", "safe_domain_name", "safe_email", "safe_hex_color", "sbn9", "secondary_address", "seed", "seed_instance", "seed_locale", "sentence", "sentences", "set_arguments", "set_formatter", "sha1", "sha256", "simple_profile", "slug", "ssn", "state", "state_abbr", "street_address", "street_name", "street_suffix", "suffix", "suffix_female", "suffix_male", "suffix_nonbinary", "swift", "swift11", "swift8", "tar", "text", "texts", "time", "time_delta", "time_object", "time_series", "timezone", "tld", "tsv", "unique", "unix_device", "unix_partition", "unix_time", "upc_a", "upc_e", "uri", "uri_extension", "uri_page", "uri_path", "url", "user_agent", "user_name", "uuid4", "vin", "weights", "windows_platform_token", "word", "words", "xml", "year", "zip", "zipcode", "zipcode_in_state", "zipcode_plus4"]
Also you can use supported_types()
function to find the supported types.
For a complete list of supported providers, refer to the Faker
documentation or use the Faker.providers
module to explore available options.
The provided Python script includes functions to install an OCR (Optical Character Recognition) tool from an installer executable and to find the path to the Tesseract OCR executable on the system.
This function installs an OCR tool using an installer executable. The installer file should be specified as ocr.exe
. The function retrieves the current script's path, combines it with the installer filename to create the full installer path, and then attempts to run the installer using subprocess.run()
.
# Example: Install OCR
install_ocr()
This function attempts to find the path to the Tesseract OCR executable. It first checks if Tesseract is installed in a specific path (C:\Program Files\Tesseract-OCR\tesseract.exe
). If not found, it then tries to locate Tesseract in the system's PATH using shutil.which("tesseract")
.
# Example: Find Tesseract Path
tesseract_path = find_tesseract()
if tesseract_path:
print(f"Tesseract found at: {tesseract_path}")
else:
print("Tesseract not found.")
- Ensure that the installer file (
ocr.exe
) is in the same directory as the script or provide the correct path to the installer in theinstall_ocr()
function. - For
find_tesseract()
, if Tesseract is not found in the specified path or in the system's PATH, it will returnNone
.
- Before using
install_ocr()
, verify that the installer file is compatible with your system. - Make sure to have the necessary permissions to install software on the system.
- If you are installing from a source other than PyPI, ensure that it is trusted or has been verified by an individual who knows what they
The provided Python script includes various functions for image processing using OpenCV and image-to-text extraction using Tesseract OCR. Below is the documentation for each function along with example usages.
Read an image from the specified path using OpenCV.
image = read_image('path/to/image.jpg')
Display the image using Matplotlib.
display_image(image, title='Original Image')
Convert the image to grayscale.
gray_image = convert_to_grayscale(image)
Apply Gaussian blur to the image.
blurred_image = apply_blur(image, kernel_size=(9, 9))
Apply Canny edge detection to the image.
edges = edge_detection(gray_image, low_threshold=30, high_threshold=100)
Resize the image to the specified dimensions.
resized_image = resize_image(image, new_size=(500, 500))
Adjust the brightness and contrast of the image.
adjusted_image = adjust_brightness_contrast(image, alpha=2.0, beta=50)
Apply a binary threshold to the image.
thresholded_image = apply_threshold(gray_image, threshold_value=100, max_value=255, threshold_type=cv2.THRESH_BINARY)
Apply dilation to the image.
dilated_image = apply_dilation(thresholded_image, kernel_size=(3, 3))
Change the intensity of a specific color channel.
modified_image = change_image_color(image, channel=2, value=50)
Find contours in the image and draw them.
contour_image = find_and_draw_contours(thresholded_image)
Find the most used color in the image.
most_used_color_value = most_used_color(image)
Get details about the image, including dimensions and pixel values.
details = image_details(image)
Convert the image to ASCII art.
ascii_art = image_to_ascii(image, scale_factor=0.05)
print(ascii_art)
Extract text from an image using Tesseract OCR.
text = extract_text_from_image('path/to/image.png', 'path/to/tesseract.exe')
print(text)
Note: Ensure that Tesseract is installed and provide the correct path to the Tesseract executable in the extract_text_from_image
function you can use install_ocr() for ocr installation.
The following are additional image processing functions that enhance the capabilities of the provided script. These functions can be used in combination with the existing ones to perform a wider range of image manipulations.
Rotate the image by a specified angle.
rotate_image(img, angle)
img
: The input image.angle
: The angle by which to rotate the image.
rotated_image = rotate_image(image, angle=45)
Apply a binary mask to the image.
apply_mask(img, mask)
img
: The input image.mask
: Binary mask with the same dimensions as the image.
# Assuming 'mask' is a binary mask with the same dimensions as the image
masked_image = apply_mask(image, mask)
Invert the colors of the image.
invert_image(img)
img
: The input image.
inverted_image = invert_image(image)
Add random noise to the image.
add_noise(img, intensity=50)
img
: The input image.intensity
: Intensity of the noise.
noisy_image = add_noise(image, intensity=30)
Apply morphological operations (dilation or erosion) to the image.
morphological_operations(img, operation='dilate', kernel_size=(5, 5))
img
: The input image.operation
: Operation to perform ('dilate' or 'erode').kernel_size
: Size of the kernel for the operation.
# Dilate the image
dilated_image = morphological_operations(image, operation='dilate', kernel_size=(3, 3))
# Erode the image
eroded_image = morphological_operations(image, operation='erode', kernel_size=(3, 3))