A Python application for masking sensitive data in CSV and Excel files using customizable fake data.
- Description
- Features
- Installation
- Usage
- Examples
- Documentation
- Jupyter Notebook Example
- Executable
- Contributing
- Author
- License
The Data Masking Tool is designed to help users anonymize sensitive data within CSV and Excel files. It replaces selected columns with fake data generated using the Faker library. The tool provides a graphical user interface (GUI) built with Tkinter, allowing users to easily select columns to mask, configure fake data types, and customize settings.
- Supports CSV and Excel (
.xlsx,.xls) files. - Customizable masking options for each column.
- Multiple fake data types, including names, emails, phone numbers, dates, and more.
- Configurable settings for fake data generation (e.g., prefixes, suffixes, ranges).
- Option to keep mappings consistent across the dataset.
- Ability to introduce blank values at a specified percentage.
- Simple and intuitive GUI.
- Generates a masked data file without altering the original file.
- Python 3.6 or higher.
- Required Python libraries:
pandasnumpyfakerxlsxwriterxlrdtkinter(comes pre-installed with Python on most systems)
-
Clone the Repository
git clone https://github.com/Marcelo-Has/data-masking-tool.git cd data-masking-tool -
Create a Virtual Environment (Recommended)
python -m venv venv
-
Activate the Virtual Environment
-
On Windows:
venv\Scripts\activate
-
On macOS/Linux:
source venv/bin/activate
-
-
Install Dependencies
pip install -r app/docs/requirements.txt
-
Run the Application
python main.py
-
Launch the Application
Run the main script:
python main.py -
Select CSV Delimiter (If Applicable)
If you're working with CSV files, choose the appropriate delimiter from the dropdown menu.
-
Upload a File
Click on the "Upload and Mask Data" button and select the CSV or Excel file you wish to mask.
-
Select Columns to Mask
- Check the boxes next to the columns you want to mask.
- For each selected column:
- Choose the Field Type for fake data generation.
- Optionally, set the Blank Percentage to introduce null values.
- Click on the ⚙️ button to configure additional settings.
-
Configure Fake Data (Optional)
Customize settings such as prefixes, suffixes, ranges, and custom lists in the configuration pane.
-
Generate Fake Data
Click the "Generate Fake Data" button to start the masking process.
-
Save the Masked Data
After processing, you'll be prompted to choose a location to save the masked file. The default filename will be the original name appended with
_masked.xlsx. -
View Logs
Monitor the progress and view detailed logs in the log display area within the application.
-
Launch the application.
-
Select the comma (
,) delimiter, or the appropriate one. -
Upload your
data.csv,data.xlsx, ordata.xlsfile. -
Select columns like
Name,Email, andPhoneto mask. -
Configure each field type as desired.
-
Generate the fake data and save the output.
Note: Your computer may deny saving depending on the folder (for example the downloads folder), in this case try saving in another folder, such as on the desktop.
- Use the Custom List field type to mask a column with specific values.
- Set up a Number field type to generate random integers within a range.
- Configure the Date field type to generate random dates between two specified dates.
- Name
- Full Name
- Address
- Phone
- UUID
- Company
- Department
- City
- Country
- Zip Code
- Product Name
- State or Province
- Row Number
- Custom List
- Number
- Date
- Prefix/Suffix: Add custom text before or after the generated fake data.
- Blank Percentage: Specify the percentage of blank (null) values to introduce.
- Custom Lists: Provide a list of custom values for masking.
- Number Ranges: Set minimum and maximum values for numeric fields.
- Date Ranges: Define start and end dates for date fields.
- UUID Types: Choose between standard UUIDs or custom alphanumeric codes.
If you prefer to use the Data Masking Tool within a Jupyter Notebook or want to integrate it into your data processing workflows without the GUI, we've provided a comprehensive notebook example that demonstrates how to use the tool programmatically.
- Notebook File: docs/DataMaskingToolExample.ipynb
The Jupyter Notebook example covers:
- Importing Necessary Modules: Instructions on setting up your environment with the required libraries.
- Defining Utility Functions: Essential functions needed for data masking operations.
- Configuring Masking Parameters: How to specify which columns to mask and configure their settings.
- Running the Masking Function: Applying the masking to your dataset.
- Reviewing Results: Viewing the masked data and logs.
- Saving Masked Data: Instructions on saving the masked DataFrame to a file.
-
Clone the Repository (If Not Already Done)
git clone https://github.com/Marcelo-Has/data-masking-tool.git cd data-masking-tool -
Navigate to the Docs Directory
cd app/docs -
Open the Notebook
You can open the notebook using Jupyter Notebook or JupyterLab:
jupyter notebook DataMaskingToolExample.ipynb
or
jupyter lab DataMaskingToolExample.ipynb
-
Install Dependencies (If Needed)
Make sure you have all the required Python libraries installed:
pip install pandas numpy faker openpyxl xlsxwriter
-
Run the Notebook
- Execute each cell sequentially to understand how the Data Masking Tool works in a notebook environment.
- The notebook includes detailed explanations and test cases for various masking configurations.
- Adjust Configurations: Modify the configurations in the notebook to suit your dataset and masking requirements.
- Integrate into Your Workflow: Use the code snippets as a starting point to integrate data masking into your data processing pipelines.
- Programmatic Control: Run data masking operations without the GUI, allowing for automation and integration with other code.
- Flexibility: Customize the masking process extensively through code.
- Documentation: The notebook serves as both a tutorial and a reference guide.
If you prefer to use the application without setting up the development environment, you can use the standalone executable file.
- The executable file can be found in the
distfolder. - Path:
dist/DataMaskingTool.exe - Usage:
Navigate to the
distdirectory. Run the executable: On Windows: Double-clickDataMaskingTool.exeor run it via command prompt.
Note: The executable includes all necessary dependencies and can be run on any Windows machine without installing Python or additional libraries.
Contributions are welcome! To contribute:
-
Fork the Repository
Click the Fork button on the top right to create a copy of this repository on your GitHub account.
-
Clone Your Fork
git clone https://github.com/Marcelo-Has/data-masking-tool.git
-
Create a Feature Branch
git checkout -b feature/your-feature-name
-
Commit Your Changes
git commit -am 'Add new feature' -
Push to the Branch
git push origin feature/your-feature-name
-
Open a Pull Request
Submit a pull request to the main repository for review.
- Follow PEP 8 style guidelines.
- Write clear, concise commit messages.
- Include docstrings and comments where necessary.
- Update or add tests for new features.
- Use the GitHub issue tracker to report bugs or request features.
- Provide detailed information and steps to reproduce issues.
Created by Marcelo Has
- Email: marcelo_has@outlook.com
- GitHub: Marcelo-Has
- Linkedin: https://www.linkedin.com/in/marcelohas/
This project is licensed under the MIT License.
Feel free to reach out with questions, suggestions, or contributions!