# Microsoft Malware Prediction

**Aim -** Can you predict if a machine will soon be hit with malware? Help protect 1 billion machines from damage before it happens.

**Summary -** The goal of this competition is to predict a Windos machine's probability of getting infected by various families of malware, based on different properties of that machine.


## About the dataset

Contains 2 files - 
* train.csv
* test.csv

**Variables/Columns present -**

Note - Unavailable or self-documenting column names are marked with an "NA".

* `MachineIdentifier` - Individual machine ID
* `ProductName` - Defender state information e.g. win8defender
* `EngineVersion` - Defender state information e.g. 1.1.12603.0
* `AppVersion` - Defender state information e.g. 4.9.10586.0
* `AvSigVersion` - Defender state information e.g. 1.217.1014.0
* `IsBeta` - Defender state information e.g. false
* `RtpStateBitfield` - NA
* `IsSxsPassiveMode` - NA
* `DefaultBrowsersIdentifier` - ID for the machine's default browser
* `AVProductStatesIdentifier` - ID for the specific configuration of a user's antivirus software
* `AVProductsInstalled` - NA
* `AVProductsEnabled` - NA
* `HasTpm` - True if machine has tpm
* `CountryIdentifier` - ID for the country the machine is located in
* `CityIdentifier` - ID for the city the machine is located in
* `OrganizationIdentifier` - ID for the organization the machine belongs in, organization ID is mapped to both specific companies and broad industries
* `GeoNameIdentifier` - ID for the geographic region a machine is located in
* `LocaleEnglishNameIdentifier` - English name of Locale ID of the current user
* `Platform` - Calculates platform name (of OS related properties and processor property)
* `Processor` - This is the process architecture of the installed operating system
* `OsVer` - Version of the current operating system
* `OsBuild` - Build of the current operating system
* `OsSuite` - Product suite mask for the current operating system.
* `OsPlatformSubRelease` - Returns the OS Platform sub-release (Windows Vista, Windows 7, Windows 8, TH1, TH2)
* `OsBuildLab` - Build lab that generated the current OS. Example: 9600.17630.amd64fre.winblue_r7.150109-2022
* `SkuEdition` - The goal of this feature is to use the Product Type defined in the MSDN to map to a 'SKU-Edition' name that is useful in population reporting. The valid Product Type are defined in %sdxroot%\data\windowseditions.xml. This API has been used since Vista and Server 2008, so there are many Product Types that do not apply to Windows 10. The 'SKU-Edition' is a string value that is in one of three classes of results. The design must hand each class.
* `IsProtected` - This is a calculated field derived from the Spynet Report's AV Products field. Returns: a. TRUE if there is at least one active and up-to-date antivirus product running on this machine. b. FALSE if there is no active AV product on this machine, or if the AV is active, but is not receiving the latest updates. c. null if there are no Anti Virus Products in the report. Returns: Whether a machine is protected.
* `AutoSampleOptIn` - This is the SubmitSamplesConsent value passed in from the service, available on CAMP 9+
* `PuaMode` - Pua Enabled mode from the service
* `SMode` - This field is set to true when the device is known to be in 'S Mode', as in, Windows 10 S mode, where only Microsoft Store apps can be installed
* `IeVerIdentifier` - NA
* `SmartScreen` - This is the SmartScreen enabled string value from registry. This is obtained by checking in order, HKLM\SOFTWARE\Policies\Microsoft\Windows\System\SmartScreenEnabled and HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\SmartScreenEnabled. If the value exists but is blank, the value "ExistsNotSet" is sent in telemetry.
* `Firewall` - This attribute is true (1) for Windows 8.1 and above if windows firewall is enabled, as reported by the service.
* `UacLuaenable` - This attribute reports whether or not the "administrator in Admin Approval Mode" user type is disabled or enabled in UAC. The value reported is obtained by reading the regkey HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System\EnableLUA.
* `Census_MDC2FormFactor` - A grouping based on a combination of Device Census level hardware characteristics. The logic used to define Form Factor is rooted in business and industry standards and aligns with how people think about their device. (Examples: Smartphone, Small Tablet, All in One, Convertible…)
* `Census_DeviceFamily` - AKA DeviceClass. Indicates the type of device that an edition of the OS is intended for. Example values: Windows.Desktop, Windows.Mobile, and iOS.Phone
* `Census_OEMNameIdentifier` - NA
* `Census_OEMModelIdentifier` - NA
* `Census_ProcessorCoreCount` - Number of logical cores in the processor
* `Census_ProcessorManufacturerIdentifier` - NA
* `Census_ProcessorModelIdentifier` - NA
* `Census_ProcessorClass` - A classification of processors into high/medium/low. Initially used for Pricing Level SKU. No longer maintained and updated
* `Census_PrimaryDiskTotalCapacity` - Amount of disk space on primary disk of the machine in MB
* `Census_PrimaryDiskTypeName` - Friendly name of Primary Disk Type - HDD or SSD
* `Census_SystemVolumeTotalCapacity` - The size of the partition that the System volume is installed on in MB
* `Census_HasOpticalDiskDrive` - True indicates that the machine has an optical disk drive (CD/DVD)
* `Census_TotalPhysicalRAM` - Retrieves the physical RAM in MB
* `Census_ChassisTypeName` - Retrieves a numeric representation of what type of chassis the machine has. A value of 0 means xx
* `Census_InternalPrimaryDiagonalDisplaySizeInInches` - Retrieves the physical diagonal length in inches of the primary display
* `Census_InternalPrimaryDisplayResolutionHorizontal` - Retrieves the number of pixels in the horizontal direction of the internal display.
* `Census_InternalPrimaryDisplayResolutionVertical` - Retrieves the number of pixels in the vertical direction of the internal display
* `Census_PowerPlatformRoleName` - Indicates the OEM preferred power management profile. This value helps identify the basic form factor of the device
* `Census_InternalBatteryType` - NA
* `Census_InternalBatteryNumberOfCharges` - NA
* `Census_OSVersion` - Numeric OS version Example - 10.0.10130.0
* `Census_OSArchitecture` - Architecture on which the OS is based. Derived from OSVersionFull. Example - amd64
* `Census_OSBranch` - Branch of the OS extracted from the OsVersionFull. Example - OsBranch = fbl_partner_eeap where * OsVersion = 6.4.9813.0.amd64fre.fbl_partner_eeap.140810-0005
* `Census_OSBuildNumber` - OS Build number extracted from the OsVersionFull. Example - OsBuildNumber = 10512 or 10240
* `Census_OSBuildRevision` - OS Build revision extracted from the OsVersionFull. Example - OsBuildRevision = 1000 or 16458
* `Census_OSEdition` - Edition of the current OS. Sourced from HKLM\Software\Microsoft\Windows NT\CurrentVersion@EditionID in registry. Example: Enterprise
* `Census_OSSkuName` - OS edition friendly name (currently Windows only)
* `Census_OSInstallTypeName` - Friendly description of what install was used on the machine i.e. clean
* `Census_OSInstallLanguageIdentifier` - NA
* `Census_OSUILocaleIdentifier` - NA
* `Census_OSWUAutoUpdateOptionsName` - Friendly name of the WindowsUpdate auto-update settings on the machine.
* `Census_IsPortableOperatingSystem` - Indicates whether OS is booted up and running via Windows-To-Go on a USB stick.
* `Census_GenuineStateName` - Friendly name of OSGenuineStateID. 0 = Genuine
* `Census_ActivationChannel` - Retail license key or Volume license key for a machine.
* `Census_IsFlightingInternal` - NA
* `Census_IsFlightsDisabled` - Indicates if the machine is participating in flighting.
* `Census_FlightRing` - The ring that the device user would like to receive flights for. This might be different from the ring of the OS which is currently installed if the user changes the ring after getting a flight from a different ring.
* `Census_ThresholdOptIn` - NA
* `Census_FirmwareManufacturerIdentifier` - NA
* `Census_FirmwareVersionIdentifier` - NA
* `Census_IsSecureBootEnabled` - Indicates if Secure Boot mode is enabled.
* `Census_IsWIMBootEnabled` - NA
* `Census_IsVirtualDevice` - Identifies a Virtual Machine (machine learning model)
* `Census_IsTouchEnabled` - Is this a touch device ?
* `Census_IsPenCapable` - Is the device capable of pen input ?
* `Census_IsAlwaysOnAlwaysConnectedCapable` - Retreives information about whether the battery enables the device to be AlwaysOnAlwaysConnected .
* `Wdft_IsGamer` - Indicates whether the device is a gamer device or not based on its hardware combination.
* `Wdft_RegionIdentifier` - NA


**Getting rid of pesky warnings**

In [None]:
import warnings 
warnings.filterwarnings('ignore')

# Importing all libraries

For numeric and data handling - 

* NumPy
* PanDas

For environmental uses - 

* Sys
* Os

For Data Preprocessing - 

* Ordinal Encoder
* Train_test_split

For Modelling - 

* LightGBM

For Parameter Selection - 

* GridSearchCV

For Plotting -

* Matplotlib
* Seaborn

In [None]:
import pandas as pd
import numpy as np
import sys
import os

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
import lightgbm as lgb
from sklearn.model_selection import GridSearchCV

import matplotlib.pyplot as plt
import seaborn as sns

**As this data is huge, roughly 8 GBs, we will make use of several practices from this amazing post -**https://www.kaggle.com/code/pavansanagapati/14-simple-tips-to-save-ram-memory-for-1-gb-dataset

**Defining column data types**

In [None]:
column_data_types = {'IsBeta' : 'Int8',
                 'RtpStateBitfield' : 'Int16',
                 'IsSxsPassiveMode' : 'Int8',
                 'HasTpm' : 'Int8',
                 'CountryIdentifier' : 'Int64',
                 'CityIdentifier' : 'Int64',
                 'OrganizationIdentifier' : 'Int64',
                 'IsProtected' : 'Int8',
                 'AutoSampleOptIn' : 'Int8',
                 'SMode' : 'Int8',
                 'Firewall' : 'Int8',
                 'Census_HasOpticalDiskDrive' : 'Int8',
                 'Census_IsPortableOperatingSystem' : 'Int8',
                 'Census_IsFlightsDisabled' : 'Int8',
                 'Census_IsSecureBootEnabled' : 'Int8',
                 'Census_IsWIMBootEnabled' : 'Int8',
                 'Census_IsVirtualDevice' : 'Int8',
                 'Census_IsTouchEnabled' : 'Int8',
                 'Census_IsPenCapable' : 'Int8',
                 'Census_IsAlwaysOnAlwaysConnectedCapable': 'Int8',
                 'Wdft_IsGamer' : 'Int8',
                 'HasDetections' : 'int'}

**Defining columns to ignore**

Columns are ignored on basis of presence of null values and link provided at the end of the notebook.

In [None]:
columns_to_ignore = ('DefaultBrowsersIdentifier',  # 95.14% NA values
                     'PuaMode', # 99.97% NA values
                     'Census_ProcessorClass', # 99.59% NA values.
                     'Census_InternalBatteryType', # 71.05% NA values
                     'Census_IsFlightingInternal', #83.04% NA values
                     'Census_ThresholdOptIn', # 63.52% NA values
                     
                     # numerical features
                     'Census_PrimaryDiskTotalCapacity', 
                     'Census_SystemVolumeTotalCapacity', 
                     'Census_TotalPhysicalRAM',        
                     'Census_InternalPrimaryDisplayResolutionHorizontal',
                     'Census_InternalPrimaryDisplayResolutionVertical',
                     'Census_InternalPrimaryDiagonalDisplaySizeInInches',
                     'Census_InternalBatteryNumberOfCharges',
                     
                     'IsBeta', 
                     'AutoSampleOptIn', 
                     'UacLuaenable', 
                     'Census_IsWIMBootEnabled',
                     
                     'Census_FlightRing_not',
                     'Census_IsAlwaysOnAlwaysConnectedCapable',
                     'Census_IsSecureBootEnabled',
                     'Census_IsTouchEnabled',
                     'Census_IsVirtualDevice',
                     'SMode'
                    )

**Defining the supervised column**

In [None]:
label_column = 'HasDetections'

## Loading the training dataset

Dataset is very large, if you try to load it entirely without columns_to_ignore and without defining the column data types, the kernel will most likely crash.

In [None]:
train_df = pd.read_csv('/kaggle/input/microsoft-malware-prediction/train.csv',usecols = lambda x: x not in columns_to_ignore,dtype = column_data_types)
train_df.head()

# Data Investigation

**We need to investigate missing values in the training dataframe**

Logic - 

Number of observations in the training dataframe = `8921483`

Therefore, 

`(1 / 8921483) * 100% = 1.12e-5%`

Will help us get 5 digits after a decimal point to find out at least one missing value.



In [None]:
for col in train_df.columns:
    print(f'"{col}" has {train_df[col].nunique()} unique values and {train_df[col].isna().sum() / len(train_df) * 100:.5f}% NA values.')

**To SMOTE or not to SMOTE**

* What is SMOTE-ing?

https://towardsdatascience.com/smote-fdce2f605729#:~:text=SMOTE%20stands%20for%20Synthetic%20Minority,imbalanced%20data%20in%20classification%20problems.

Investigating if the target column is skewed

In [None]:
sns.countplot(x = train_df[label_column], orient = "h")

**As the label column does not signs of data skew, we do not need to SMOTE**

# Data Preprocessing

In [None]:
train_df.drop(columns = ['MachineIdentifier'], inplace = True)

**We define a custom pre-processing function to pre-process and take care of the null values in the categorical variables.**

In this function we make use of `sys.intern` command to reduce memory usage. This is explained well here - https://stackoverflow.com/questions/76104472/python-str-lower-causes-memory-leak

In [None]:
def categorical_preprocessing(df):
    temp = df.copy()
    
    cols = temp.select_dtypes(include = [object]).columns.tolist()   
    temp[cols] = temp[cols].astype(str).apply(lambda x: x.str.lower().apply(sys.intern))
    
    os_build_lab_cat = 'OsBuildLab'
    if os_build_lab_cat in temp.columns:
        os_build_lab_df = temp[os_build_lab_cat].str.split(pat = '.', n = 5, expand = True)
        os_build_lab_df = os_build_lab_df.astype(str).apply(lambda x: x.str.lower().apply(sys.intern))
        os_build_lab_df = os_build_lab_df.add_prefix(os_build_lab_cat + '_')
        
        temp = pd.concat([temp, os_build_lab_df], axis = 1)
        temp = temp.drop(columns = os_build_lab_cat)
    
    smart_screen_cat = 'SmartScreen'
    if smart_screen_cat in temp.columns:
        temp.loc[temp[smart_screen_cat] == 'promt', smart_screen_cat] = 'prompt'
        temp.loc[temp[smart_screen_cat] == '00000000', smart_screen_cat] = '0'
        temp[smart_screen_cat] = temp[smart_screen_cat].astype(str).apply(sys.intern)
        
    disk_type_cat = 'Census_PrimaryDiskTypeName'
    if disk_type_cat in temp.columns:
        disk_types = ['HDD', 'SSD']
        temp.loc[~temp[disk_type_cat].isin(disk_types), disk_type_cat] == 'na'                                
        temp[disk_type_cat] = temp[disk_type_cat].astype(str).apply(sys.intern)
        
    role_name_cat = 'Census_PowerPlatformRoleName'
    if role_name_cat in temp.columns:
        na_types = ['unspecified', 'unknown', np.nan]
        temp.loc[temp[role_name_cat].isin(na_types), role_name_cat] == 'na'                             
        temp[role_name_cat] = temp[role_name_cat].astype(str).apply(sys.intern)
    
    return temp

In [None]:
train_df = categorical_preprocessing(train_df)

**Checking for duplicate values in the training dataframe**

In [None]:
print("Duplicates in the training dataframe :",train_df.duplicated().sum())

**Dropping these duplicate values**

In [None]:
train_df.drop_duplicates(inplace = True)

**Return label column and drop it from the training  dataframe.**

In [None]:
y = train_df.pop(label_column)

**Defining columns into 2 categories, string and not-string columns**

In [None]:
string_columns = train_df.select_dtypes(include = 'object').columns
not_string_columns = train_df.select_dtypes(exclude = 'object').columns

In [None]:
na_value = -1

train_df[string_columns] = train_df[string_columns].fillna('na')
train_df[not_string_columns] = train_df[not_string_columns].fillna(na_value)
train_df.isna().sum()

**Splitting the training dataframe**

We split the training dataframe into training and validation sets to help us build the model and then get the best model parameters using GridSearchCV.

`Split ratio = 80:20 :: Train:Val`

The model with the best parameters will then be used to predict on the test set.

In [None]:
X_train, X_val, y_train, y_val = train_test_split(train_df, y, test_size = 0.2, random_state = 42)

**Removing training dataframe and label column to avoid filling of memory**

In [None]:
del train_df, y

In [None]:
cols = X_train.columns.tolist()

**Ordinal Encoding**

Encode the categorical features in our training data as an integer array.

The input to ordinal transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. The features are converted to ordinal integers. This results in a single column of integers (0 to n_categories - 1) per feature.

In [None]:
unknown_value = -100
ord_enc = OrdinalEncoder(handle_unknown = 'use_encoded_value', dtype = 'int32', unknown_value = unknown_value)
X_train_encoded = ord_enc.fit_transform(X_train)
X_val_encoded = ord_enc.transform(X_val)

In [None]:
maximum_values = X_train_encoded.max(axis = 0) + 1
X_val_encoded = np.where(X_val_encoded == unknown_value, maximum_values, X_val_encoded)

**Again we flush the data not required any more to avoid data filling.**

In [None]:
del X_train, X_val

# Modelling

**We fit a LightGradident Boosting Model on our training dataframe**
* **What is Gradient Boosting?**

Gradient boosting is an iterative algorithm that builds a model in a forward stage-wise fashion. It starts by fitting a simple model, such as a decision tree, to the data and then adds additional models to correct the errors made by the previous models. Each new model is fit to the negative gradient of the loss function with respect to the previous model’s predictions. The final model is a weighted sum of all the individual models.

* **What is the LGBM Model?** 

LightGBM is a gradient boosting ensemble method that is used by the Train Using AutoML tool and is based on decision trees. As with other decision tree-based methods, LightGBM can be used for both classification and regression. LightGBM is optimized for high performance with distributed systems.

* **Why Light GBM? Why not GBM?**

GBMs are powerful machine learning models that have been shown to outperform many other types of models, including deep neural networks, in a variety of tasks. LightGBM uses a novel technique called histogram-based binning allowing it to learn from data more efficiently than traditional GBMs.




In [None]:
X_sample, _, y_sample, _ = train_test_split(X_train_encoded, y_train, train_size = 0.1, random_state = 42)

In [None]:
parameters = {
              'n_estimators' : [500, 600], 'max_depth' : [7, 8],
              'colsample_bytree' : [0.6, 0.7], 'num_leaves' : [70, 80]
             }

clf = lgb.LGBMClassifier(learning_rate = 0.1)
grid_search_clf = GridSearchCV(clf, parameters, verbose = 2)
grid_search_clf.fit(X_sample, y_sample)

In [None]:
print("These are the best parameters from grid search -")
grid_search_clf.best_params_

**Grid Searching the best parameters for our Light Gradient Boosting Model, this would enable us to get the best score on the test set**

In [None]:
clf_lgbm = grid_search_clf.best_estimator_

In [None]:
clf_lgbm.fit(X_train_encoded, y_train,
           eval_set = [(X_train_encoded, y_train), (X_val_encoded, y_val)],
           eval_names = ['train', 'val'],
           eval_metric ='auc',
           callbacks = [lgb.log_evaluation(50), lgb.early_stopping(5)])

**Our preferred performance metric is Area Under the Curve (AUC) as this is a binary classification problem and we plot it alongside the Model loss for which we have chosen Binary Log-Loss as a metric.**

In [None]:
plt.figure(figsize=(14, 5))
train_auc = clf_lgbm.evals_result_['train']['auc']
val_auc = clf_lgbm.evals_result_['val']['auc']

plt.subplot(1,2,1)
plt.plot(train_auc,  'bo', label = 'Training AUC')
plt.plot(val_auc,  'r', label = 'Validation AUC')
plt.title('Light Gradient Boosting : Area Under Curve')
plt.legend()


train_loss = clf_lgbm.evals_result_['train']['binary_logloss']
val_loss = clf_lgbm.evals_result_['val']['binary_logloss']
plt.subplot(1,2,2)
plt.plot(train_loss, 'bo', label = 'Trainig loss')
plt.plot(val_loss, 'r', label = 'Validation loss')
plt.title('Light Gradient Boosting : Binary Log-Loss')
plt.legend()
    
plt.show()

**Again, painstakingly, we get rid of unnecessary data**

In [None]:
del y_train, y_val, X_train_encoded, X_val_encoded

# Testing Phase

We have built the Light Gradient Boosting Model with the best paramaters found through GridSearchCV.

Now we use it on the test data to create predictions.

We load the test csv into the test dataframe using the same `columns_to_ignore` and `column_data_types` functions.

In [None]:
test_df = pd.read_csv('/kaggle/input/microsoft-malware-prediction/test.csv',
                      usecols = lambda x: x not in columns_to_ignore,
                      dtype = column_data_types)
test_df.head()

In [None]:
submission = test_df.pop('MachineIdentifier').to_frame()

**We apply the same pre-processing function that we defined earlier for the training datafrane**

In [None]:
test_df = categorical_preprocessing(test_df)

In [None]:
test_df[string_columns] = test_df[string_columns].fillna('na')
test_df[not_string_columns] = test_df[not_string_columns].fillna(na_value)
test_df.isna().sum()

**We carry out ordinal transformation as we had performed earlier on the training dataframe**

In [None]:
test_df = ord_enc.transform(test_df)
test_df = np.where(test_df == unknown_value, maximum_values, test_df)

**We create the predictions and load it into a csv for submission to the competition**

In [None]:
submission[label_column] = clf_lgbm.predict_proba(test_df)[:, 1]
submission.head()

In [None]:
submission.to_csv('submission.csv', index = False)

**Shoutout to -**

Maryia Znak - https://www.kaggle.com/code/maryiaznak/eda-chi-square-test-logreg?scriptVersionId=129326902

Pavan Sanagapati - https://www.kaggle.com/code/pavansanagapati/14-simple-tips-to-save-ram-memory-for-1-gb-dataset