Skip to content

One step ETL python package for basic data manipulation.

License

Notifications You must be signed in to change notification settings

FlintyTub49/MAHA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MAHA

MAHA is an in-progress ETL package which uses machine learning to clean your dataset with one line command. Features of MAHA include :-

  • Drop all the index columns
  • Drop columns with too many missing values
  • Using Regression to find the missing values in the data and then replacing them

Prerequisites

  • Data is in pandas DataFrame format
  • All the categorical variables are label encoded
  • All the columns are in the desired data type of the output

You can also:

  • Find the mean and mode of every column
  • Fill the NA values with mean and mode of the columnns depending on the datatype
  • Find a model for every column with all other columns being the independent variables

Dependencies

MAHA uses a number of open source projects to work properly:

  • NumPy - NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
  • Pandas - Pandas is a software library written for the Python programming language for data manipulation and analysis.
  • Sklearn - Machine Learning library which includes various classification, regression and clustering algorithms

Installation

MAHA requires pandas, numpy and sklearn

Use pip to install the packages

$ pip3 install pandas
$ pip3 install numpy
$ pip3 install sklearn

If you have not installed pip, you can do it by

$ curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py

Then run the following command where you have installed get-pip.py

$ python get-pip.py

Development

Developed By :- Mithesh R, Arth Akhouri, Heetansh Jhaveri, Ayaan Khan

Want to contribute? Navigate to our GitHub for more information GitHub Repository - MAHA

License

MIT

About

One step ETL python package for basic data manipulation.

Resources

License

Stars

Watchers

Forks

Packages

No packages published