Skip to content

Akanksha-Tonne/TheDataMoguls

Repository files navigation

The Data Moguls

Data Analytics Project 2019: Microsoft Malware Prediction

  • Step 1 : Pre-Processing { Refer to file Final/TheDataMoguls_Stage1.ipynb }
    This file contains code for
    1. preprocessing
    2. model fitting
    3. LGBM classifier: obtained the AUC of 0.72
    4. feature extraction
    5. visualization.

    Since the dataset is large, running this file takes around 15 minutes.
  • Step 2: Data Preparation {Refer to file Final/TheDataMoguls_DataPreparation.ipynb}
    This file contains code for merging the original dataset with the external data provided at the kaggle notebook (https://www.kaggle.com/cdeotte/external-data-malware-0-50).
  • Step 3:LSTM Classifier {Refer to Final/TheDataMoguls_LSTM.ipynb}
    This file contains code the LSTM model applied to the combined dataset. The accuracy obtained was 50%.
  • Step 4:Adaboost Classifier {Refer to file Final/TheDataMoguls_Adaboost.ipynb}
    This file contains code for the Adaboost Classifier applied to the combined dataset. The accuracy obtained was 55%.
  • Step 5:LightGBM Classifer {} {Refer to file Final/TheDataMoguls_LightGBM.ipynb}
    The lightGBM model was applied to the combined data. The AUC obtained was 0.57.
Conclusion Adaboost Classifier and LSTM do not give a good accuracy for combined data. Changing parameters do not affect the results. Low accuracy indicates that the merging the datasets may not have captured the essence of the actual classification problem.
Light GBM does not require categorical features to be encoded and for a time dependent classification it suits well. It is fast when applied to large datasets. /

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published