DATA CLEANING

## Description

In any Machine Learning process, Data Preprocessing is the primary step wherein the raw/unclean data are transformed into cleaned data, So that in the later stage, machine learning algorithms can be applied. This python paackage make the data preprocessing very easy in just 2 lines of code. All you have to do is just input a raw data(CSV file), this library will clean your data and return you the cleaned dataframe on which further you can apply feature engineering, feature selection and modeling.

What this does?
- Cleans special character
- Removes duplicates
- Fixes abnormality in column names
- Imputes the data (categorical & numerical)

Data Cleaning

Data-cleaning is a python package for data preprocessing. This cleans the CSV file and returns the cleaned data frame. It does the work of imputation, removing duplicates, replacing special characters, and many more.

How to use:

Step 1: Install the libaray

pip install data-cleaning

Step 2:

Import the library, and specify the path of the csv file.

from datacleaning import DataCleaning

dp = DataCleaning(file_upload='filename.csv')
cleaned_df = dp.start_cleaning()

There are some optional parameters that you can specify as listed below,

Usage:

from datacleaning import DataCleaning

DataCleaning(file_upload='filename.csv', separator=",", row_threshold=None, col_threshold=None,
         special_character=None, action=None, ignore_columns=None, imputation_type="RDF")

Parameters

Parameter	Default Value	Limit	Example
file_upload	none	Provide a CSV file.	filename.csv
separator	,	Separator used in csv file	;
row_threshold	none	0 to 100	80
col_threshold	none	0 to 100	80
special_character	Check the list below	Sspecify the character that is not listed in default_list (see below)	[ '$' , '?' ]
action	none	add or remove	add
ignore_columns	none	Provide list of column names to ignoring the special characters operation.	[ 'column1', 'column2' ]
imputation_type	RDF	Select your preferred imputation RDF, KNN, mean, median, most_frequent, constant .	KNN

Examples of using parameters

- Appending extra special characters to the existing default_list

The DEFAULT SPECIAL CHARACTERS included in the package are shown below,

default_list = ["!", '"', "#", "%", "&", "'", "(", ")",
                  "*", "+", ",", "-", ".", "/", ":", ";", "<",
                  "=", ">", "?", "@", "[", "\\", "]", "^", "_",
                  "`", "{", "|", "}", "~", "–", "//", "%*", ":/", ".;", "Ø", "§",'$',"£"]

How to remove a special character, say for example if you want to remove "?" and "%".

Note:- Do not forget to give action = 'remove'

from datacleaning import DataCleaning

dp = DataCleaning(file_upload='filename.csv', special_character =['?', '%'], action='remove')
cleaned_df = dp.start_cleaning()

How to add a special character, say for example if you want to add "é" that is not in the default_list given above.

Note:- Do not forget to give action = 'add'

from datacleaning import DataCleaning

dp = DataCleaning(file_upload='filename.csv', special_character =['é'], action='add')
cleaned_df = dp.start_cleaning()

- Ignoring a particular columns and adding a special character

Say for example, column named "timestamp" and "date" needs to be removed and a special character needs to be added 'é'

from datacleaning import DataCleaning

dp = DataCleaning(file_upload='filename.csv', special_character =['é'],
              action='add', ignore_columns=['timestamp', 'date'])
cleaned_df = dp.start_cleaning()

- Changing threshold to remove null rows/columns above this given threshold value

from datacleaning import DataCleaning

dp = DataCleaning(file_upload='filename.csv', row_threshold=50, col_threshold=90)
cleaned_df = dp.start_cleaning()

- Imputation methods available

RDF (RandomForest) -> (DEFAULT)
KNN (k-nearest neighbors)
mean
median
most_frequent
constant

# Example for KNN imputation.
from datacleaning import DataCleaning

dp = DataCleaning(file_upload='filename.csv', imputation_type='KNN')
cleaned_df = dp.start_cleaning()

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
datacleaning		datacleaning
examples		examples
img		img
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DATA CLEANING

Data Cleaning

How to use:

Usage:

Parameters

Examples of using parameters

- Appending extra special characters to the existing default_list

- Ignoring a particular columns and adding a special character

- Changing threshold to remove null rows/columns above this given threshold value

- Imputation methods available

>> THANK YOU <<

About

Releases

Packages

Contributors 2

Languages

License

DataPreprocessing/DataCleaning

Folders and files

Latest commit

History

Repository files navigation

DATA CLEANING

Data Cleaning

How to use:

Usage:

Parameters

Examples of using parameters

- Appending extra special characters to the existing default_list

- Ignoring a particular columns and adding a special character

- Changing threshold to remove null rows/columns above this given threshold value

- Imputation methods available

>> THANK YOU <<

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages