In any Machine Learning process, Data Preprocessing is the primary step wherein the raw/unclean data are transformed into cleaned data, So that in the later stage, machine learning algorithms can be applied. This python paackage make the data preprocessing very easy in just 2 lines of code. All you have to do is just input a raw data(CSV file), this library will clean your data and return you the cleaned dataframe on which further you can apply feature engineering, feature selection and modeling.
- What this does?
- Cleans special character
- Removes duplicates
- Fixes abnormality in column names
- Imputes the data (categorical & numerical)
Data-cleaning is a python package for data preprocessing. This cleans the CSV file and returns the cleaned data frame. It does the work of imputation, removing duplicates, replacing special characters, and many more.
Step 1: Install the libaray
pip install data-cleaning
Step 2:
Import the library, and specify the path of the csv file.
from datacleaning import DataCleaning
dp = DataCleaning(file_upload='filename.csv')
cleaned_df = dp.start_cleaning()
There are some optional parameters that you can specify as listed below,
from datacleaning import DataCleaning
DataCleaning(file_upload='filename.csv', separator=",", row_threshold=None, col_threshold=None,
special_character=None, action=None, ignore_columns=None, imputation_type="RDF")
Parameter | Default Value | Limit | Example |
---|---|---|---|
file_upload | none | Provide a CSV file. | filename.csv |
separator | , | Separator used in csv file | ; |
row_threshold | none | 0 to 100 | 80 |
col_threshold | none | 0 to 100 | 80 |
special_character | Check the list below | Sspecify the character that is not listed in default_list (see below) |
[ '$' , '?' ] |
action | none | add or remove | add |
ignore_columns | none | Provide list of column names to ignoring the special characters operation. |
[ 'column1', 'column2' ] |
imputation_type | RDF | Select your preferred imputation RDF, KNN, mean, median, most_frequent, constant . |
KNN |
The DEFAULT SPECIAL CHARACTERS included in the package are shown below,
default_list = ["!", '"', "#", "%", "&", "'", "(", ")",
"*", "+", ",", "-", ".", "/", ":", ";", "<",
"=", ">", "?", "@", "[", "\\", "]", "^", "_",
"`", "{", "|", "}", "~", "–", "//", "%*", ":/", ".;", "Ø", "§",'$',"£"]
How to remove a special character, say for example if you want to remove "?" and "%".
Note:- Do not forget to give action = 'remove'
from datacleaning import DataCleaning
dp = DataCleaning(file_upload='filename.csv', special_character =['?', '%'], action='remove')
cleaned_df = dp.start_cleaning()
How to add a special character, say for example if you want to add "é" that is not in the default_list given above.
Note:- Do not forget to give action = 'add'
from datacleaning import DataCleaning
dp = DataCleaning(file_upload='filename.csv', special_character =['é'], action='add')
cleaned_df = dp.start_cleaning()
Say for example, column named "timestamp" and "date" needs to be removed and a special character needs to be added 'é'
from datacleaning import DataCleaning
dp = DataCleaning(file_upload='filename.csv', special_character =['é'],
action='add', ignore_columns=['timestamp', 'date'])
cleaned_df = dp.start_cleaning()
from datacleaning import DataCleaning
dp = DataCleaning(file_upload='filename.csv', row_threshold=50, col_threshold=90)
cleaned_df = dp.start_cleaning()
- RDF (RandomForest) -> (DEFAULT)
- KNN (k-nearest neighbors)
- mean
- median
- most_frequent
- constant
# Example for KNN imputation.
from datacleaning import DataCleaning
dp = DataCleaning(file_upload='filename.csv', imputation_type='KNN')
cleaned_df = dp.start_cleaning()