Predicting survey responses

Machine learning project (2021-09 ~ 12)

This repository is a project I did in a machine learning class at school. In this repository you can find several python files, the final project is Predicting survey response. The project data is not publicly available as it is used in the field.

The project in this repository is divided into five main parts.

model hyperparameter tuning
unbalanced data handling
submission ensemble
machine learning pipelines
Predicting survey response

This is a course I took in the second semester of my sophomore year, and I worked hard on it for another semester. Especially for the last 5 projects, it was a cagle competition, and I tried different ways to improve the performance.

Visualization

Feature importance visualization through SHAP module.

Model predict correlation heatmap visualization.

After Model Ensemble, compare score of models.

Project Description

Projects

1. Model hyperparameter tunning - Using two ongoing competitions on Kaggle, Predict Future Sales (link : https://www.kaggle.com/competitions/competitive-data-science-predict-future-sales) and Categorical Feature Encoding Challenge (link : https://www.kaggle.com/competitions/cat-in-the-dat), we performed ensembles on various submissions to improve our performance.

2. Imbalanced Data Processing - Using the WSDM - KKBox's Music Recommendation Challenge ( link : https://www.kaggle.com/competitions/kkbox-music-recommendation-challenge ) dataset released on Cagle, experiment with the Imbalanced Data Processing technique presented in the paper Survey of resampling techniques for improving classification performance in unbalanced datasets ( link : https://arxiv.org/abs/1608.06048 ). For the experiments, the dataset is constructed with class ratios of 10:90, 1:99, and 0.1:99.9, and then three situations are created. We use Accuracy, ROC-AUC, Recall, Precision, and F1-score as evaluation measures. Here we can see that the imbalanced Data Processing technique improves the accuracy for relatively small classes, even though the overall accuracy remains the same. We also see that in these cases, mbalanced Data Processing should be used when the discovery of relative silver classes is important. And I learned that imbalanced Data Processing techniques include Under_Sampling, Over_Sampling, Combine, and Ensemble. You can see the details in the pptx file.

3. Submissions Ensemble - Using two ongoing competitions on Kaggle, Predict Future Sales (link : https://www.kaggle.com/competitions/competitive-data-science-predict-future-sales) and Categorical Feature Encoding Challenge (link : https://www.kaggle.com/competitions/cat-in-the-dat), we performed ensembles on various submissions to improve our performance. We used arithmetic mean, weighted mean, geometric mean, and median mean for ensembling submissions, as well as several combinations. We checked the results by ensembling submissions with similar performance and submissions with low correlation. You can see the detailed experimental table in Submission Ensemble.pptx.

4. Model pipeline -Analyze the attributes of a department store customer's purchases over the course of a year to predict the customer's gender (0: female, 1: male).

Build an overall machine learning pipeline.

Read Data
EDA (Identify Missing Values, Identify Data Attributes, Identify Similarities)
Data processing (missing value handling, outlier handling, character variable encoding, feature selection)
Model Hyperparameter Optimization

Models used LogisticRegression, KNN, SupportVectorMachine, Extra tree, MLP

Model Ensemble

Tries all combinations and picks the best performing one.
Finally, stack all the models above and select the best performing combination.
Once again, we proceed with the submission ensemble.

Model Pipeline Strategy The model pipeline strategy we envisioned is as follows: the traditional feature selection approach is to use the feature selection power-law results from a single model to lock in on a specific dataset, and then proceed with the rest of the pipeline.
-> I realized that the important features might be different for each model, so I proceeded with the feature selection for each model presented above. If we do it for each model, the important features will be slightly different, and we will end up with as many different train datasets as there are models. Each model will have its own dataset and use it to make its own predictions. Since the datasets and models are different, the predictions are less correlated than before, and the performance of each is expected to improve because each has an optimal feature selection. Finally, we run the ensemble through all the combinations of the number of cases with their respective prediction results, and then stack the results by stacking the results that are less correlated with each other, and run the submission ensemble once again to generate the final submission.

5. Predicting survey responses -This is the final project, Survey Response Prediction.

Setup : Due to the limitation of computing resources, we first convert the type of data values in the dataset to reduce the memory capacity. Fix the seed to prevent the value from varying from run to run.
Before making predictions, we needed to think about a few things: this data is real-world data, and there are a lot of errors in the data because it was collected in the field. Therefore, we remove data that will not help us make predictions, for example, columns with more than 30% missing values. Also, this data has a time series nature, so using future information can cause the model to learn incorrectly, which can lead to overfitting.

Data processing: Remove outliers and missing values. Apply one-hot encoding for categorical and standard scaling for numerical.
Create new features: Add features to improve the prediction results by adding ideas in various ways.
feature selection
divide the dataset: 6:4, 7:3 for experiments.
create model

Models used: DNN, LGBM, LogisticRegression, CatBoost, AdaBoost, XGB, RandomForest

Model Ensemble, Bagging

Ensemble the models and bag some of them. Due to the nature of this data, there was a lot of overfitting, and bagging compensated for this, resulting in a lot of performance improvement.

Submission ensemble: We ensemble the finalized submissions in various combinations and expect to improve performance.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
img		img
분류submission		분류submission
분류고성능submission		분류고성능submission
회귀submission		회귀submission
회귀고성능submission		회귀고성능submission
.DS_Store		.DS_Store
Imbalanced_Data.pptx		Imbalanced_Data.pptx
Imbalanced_data_processing1.ipynb		Imbalanced_data_processing1.ipynb
Imbalanced_data_processing2.ipynb		Imbalanced_data_processing2.ipynb
Model_hyperparameter_tunning.ipynb		Model_hyperparameter_tunning.ipynb
Model_predict_pipeline.ipynb		Model_predict_pipeline.ipynb
Predicting survey responses.ipynb		Predicting survey responses.ipynb
README.md		README.md
Submissions ensemble.ipynb		Submissions ensemble.ipynb
Submissions ensemble.pptx		Submissions ensemble.pptx
Survey of resampling techniques.pdf		Survey of resampling techniques.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting survey responses

Machine learning project (2021-09 ~ 12)

The project in this repository is divided into five main parts.

Visualization

Feature importance visualization through SHAP module.

Model predict correlation heatmap visualization.

After Model Ensemble, compare score of models.

Project Description

Projects

About

Releases

Packages

Languages

Go-MinSeong/Predicting-survey-responses

Folders and files

Latest commit

History

Repository files navigation

Predicting survey responses

Machine learning project (2021-09 ~ 12)

The project in this repository is divided into five main parts.

Visualization

Feature importance visualization through SHAP module.

Model predict correlation heatmap visualization.

After Model Ensemble, compare score of models.

Project Description

Projects

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages