Data: The dataset contains 80 input features, and 1 Target variable called SalePrice. We're expected to predict the sale price of different houses with various features such as MSSubClass, LotFrontage, LotArea, etc.
This is a Regression problem. You can import dataset from the following link to replicate the same results and follow along the experiement. We'll use XgBoost to solve this problem.
Dependencies: : You'll need to install below dependencies to run this project.
- json: 2.0.9
- pandas: 1.0.1
- numpy: 1.18.1
- matplotlib: 3.5.3
- seaborn: 0.10.0
- sklearn: 0.22.1
The code has been tested on Windows system. It should work well on other distributions but has not yet been tested.
In case of any issue with installation or otherwise, please contact me on Linkedin
- Wite a re-usable function to determine data type, Null Counts, Unique values, and Null_Percent in each variable and store in a dataframe.
- Feature Engineering.
- Easy method to check Null values across different features in dataset.
- Encode rare categories using RareLabelEncoder.
- Creating Class for temporal transformation that is compatible with SK_learn pipeline.
- Building the Pre-Processing sklearn pipeline for data preprocessing such missing value imputation, feature engineering, data encoding, etc.
- Calculate the feature importance
- Automatic important feature selection using SelectFromModel.
- Compare different model version such as Model without preprocessing data, Model with processed data, and Model with important variables only.
If you have a Data Science mini-project that you'd like to share, please follow the guidelines in CONTRIBUTING.md.
Please adhere to our Code of Conduct in all your interactions with the project.
This project is licensed under the MIT License.
For questions or inquiries, feel free to contact me on Linkedin.
I’m a seasoned Data Scientist and founder of TowardsMachineLearning.Org. I've worked on various Machine Learning, NLP, and cutting-edge deep learning frameworks to solve numerous business problems.