# <b><u>MICROSOFT MALWARE PREDICTION</u></b>

El objetivo de este ejercicio es estimar la probabilidad de que una máquina con Sistema Operativo Windows se vea infectada por algún tipo de malware, en base a las distintas propiedades de la máquina.  

Desarrollar un Notebook con nuestra propuesta de modelo para resolver el problema. El Notebook debe contener todas las etapas de la ML Checklist debidamente comentadas (se valorará la claridad), y ejecutar sin problemas para obtener el modelo resultado.  

En concreto, debe realizarse la exploración de datos (se valorará el desarrollo de visualizaciones interesantes), el preprocesamiento, el modelado mediante un Decision Tree (opcionalmente, explorar otros algoritmos) y la evaluación.

## Importación de librerías necesarias.

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('ggplot')

from sklearn import preprocessing

## Obtención del dataframe.

In [2]:
file_path = '/Users/orlando/Documents/02-Entregable_ms_malware_prediction/sample_mmp.csv'

df_mmp = pd.read_csv(file_path, low_memory=False)

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

## Análisis univariante de datos.

### Tamaño del dataframe.

In [3]:
df_mmp.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Columns: 84 entries, Unnamed: 0 to HasDetections
dtypes: float64(36), int64(18), object(30)
memory usage: 320.4+ MB


### Visualización de datos.

In [4]:
df_mmp.head(3).T

Unnamed: 0,0,1,2
Unnamed: 0,8427007,8829090,2731904
MachineIdentifier,f1cd864e97bae82bdf96523e1a539121,fd5ba6f5b75325ec0423a6c67cc75942,4e628391e7cc7c482fb3286f486dbd25
ProductName,win8defender,win8defender,win8defender
EngineVersion,1.1.15100.1,1.1.15100.1,1.1.15100.1
AppVersion,4.18.1807.18075,4.18.1807.18075,4.9.10586.1106
AvSigVersion,1.273.1234.0,1.273.1282.0,1.273.781.0
IsBeta,0,0,0
RtpStateBitfield,7.0,7.0,7.0
IsSxsPassiveMode,0,0,0
DefaultBrowsersIdentifier,,,


- <b>Unnamed: 0: </b>index.  
- <b>MachineIdentifier: </b>Individual machine ID.  
- <b>ProductName: </b>Defender state information e.g. win8defender.  
- <b>EngineVersion: </b>Defender state information e.g. 1.1.12603.0.  
- <b>AppVersion: </b>Defender state information e.g. 4.9.10586.0.  
- <b>AvSigVersion: </b>Defender state information e.g. 1.217.1014.0.  
- <b>IsBeta: </b>Defender state information e.g. false.  
- <b>RtpStateBitfield: </b>RTP state: Realtime protection state (Enabled or Disabled).  
- <b>IsSxsPassiveMode: </b>active/passive mode of operation for Windows Defender. If another third party primary antivirus exists on the system, the Defender enters Passive mode. Passive mode obviously offers reduced functionality.  
- <b>DefaultBrowsersIdentifier: </b>ID for the machine's default browser.  
- <b>AVProductStatesIdentifier: </b>ID for the specific configuration of a user's antivirus software.  
- <b>AVProductsInstalled: </b>Active anti-virus of the total installed.  
- <b>AVProductsEnabled: </b>Of the installed antiviruses, those that are active.  
- <b>HasTpm: </b>True if machine has tpm.  
- <b>CountryIdentifier: </b>ID for the country the machine is located in.  
- <b>CityIdentifier: </b>ID for the city the machine is located in.  
- <b>OrganizationIdentifier: </b>ID for the organization the machine belongs in, organization ID is mapped to both specific companies and broad industries.  
- <b>GeoNameIdentifier: </b>ID for the geographic region a machine is located in.  
- <b>LocaleEnglishNameIdentifier: </b>English name of Locale ID of the current user.  
- <b>Platform: </b>Calculates platform name (of OS related properties and processor property).  
- <b>Processor: </b>This is the process architecture of the installed operating system.  
- <b>OsVer: </b>Version of the current operating system.  
- <b>OsBuild: </b>Build of the current operating system.  
- <b>OsSuite: </b>Product suite mask for the current operating system..  
- <b>OsPlatformSubRelease: </b>Returns the OS Platform sub.  
- <b>OsBuildLab: </b>Build lab that generated the current OS. Example: 9600.17630.amd64fre.winblue_r7.150109.  
- <b>SkuEdition: </b>The goal of this feature is to use the Product Type defined in the MSDN (Microsoft Developer Network) to map to a SKU (Stock Keeping Unit).  
- <b>IsProtected: </b>This is a calculated field derived from the Spynet Report's AV Products field. Returns: a. TRUE if there is at least one active and up.  
- <b>AutoSampleOptIn: </b>This is the SubmitSamplesConsent value passed in from the service, available on CAMP 9+.  
- <b>PuaMode: </b>Pua Enabled mode from the service.  
- <b>SMode: </b>This field is set to true when the device is known to be in 'S Mode', as in, Windows 10 S mode, where only Microsoft Store apps can be installed.  
- <b>IeVerIdentifier: </b>Retrieves which version of Internet Explorer is running on this device.  
- <b>SmartScreen: </b>This is the SmartScreen enabled string value from registry. This is obtained by checking in order, HKLM\SOFTWARE\Policies\Microsoft\Windows\System\SmartScreenEnabled and HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\SmartScreenEnabled. If the value exists but is blank, the value "ExistsNotSet" is sent in telemetry..  
- <b>Firewall: </b>This attribute is true (1) for Windows 8.1 and above if windows firewall is enabled, as reported by the service..  
- <b>UacLuaenable: </b>This attribute reports whether or not the "administrator in Admin Approval Mode" user type is disabled or enabled in UAC. The value reported is obtained by reading the regkey HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System\EnableLUA..  
- <b>Census_MDC2FormFactor: </b>A grouping based on a combination of Device Census level hardware characteristics. The logic used to define Form Factor is rooted in business and industry standards and aligns with how people think about their device. (Examples: Smartphone, Small Tablet, All in One, Convertible...).  
- <b>Census_DeviceFamily: </b>AKA DeviceClass. Indicates the type of device that an edition of the OS is intended for. Example values: Windows.Desktop, Windows.Mobile, and iOS.Phone.  
- <b>Census_OEMNameIdentifier: </b>NA.  
- <b>Census_OEMModelIdentifier: </b>NA.  
- <b>Census_ProcessorCoreCount: </b>Number of logical cores in the processor.  
- <b>Census_ProcessorManufacturerIdentifier: </b>NA.  
- <b>Census_ProcessorModelIdentifier: </b>NA.  
- <b>Census_ProcessorClass: </b>A classification of processors into high/medium/low. Initially used for Pricing Level SKU. No longer maintained and updated.  
- <b>Census_PrimaryDiskTotalCapacity: </b>Amount of disk space on primary disk of the machine in MB.  
- <b>Census_PrimaryDiskTypeName: </b>Friendly name of Primary Disk Type .  
- <b>Census_SystemVolumeTotalCapacity: </b>The size of the partition that the System volume is installed on in MB.  
- <b>Census_HasOpticalDiskDrive: </b>True indicates that the machine has an optical disk drive (CD/DVD).  
- <b>Census_TotalPhysicalRAM: </b>Retrieves the physical RAM in MB.  
- <b>Census_ChassisTypeName: </b>Retrieves a numeric representation of what type of chassis the machine has. A value of 0 means xx.  
- <b>Census_InternalPrimaryDiagonalDisplaySizeInInches: </b>Retrieves the physical diagonal length in inches of the primary display.  
- <b>Census_InternalPrimaryDisplayResolutionHorizontal: </b>Retrieves the number of pixels in the horizontal direction of the internal display..  
- <b>Census_InternalPrimaryDisplayResolutionVertical: </b>Retrieves the number of pixels in the vertical direction of the internal display.  
- <b>Census_PowerPlatformRoleName: </b>Indicates the OEM preferred power management profile. This value helps identify the basic form factor of the device.  
- <b>Census_InternalBatteryType: </b>NA.  
- <b>Census_InternalBatteryNumberOfCharges: </b>NA.  
- <b>Census_OSVersion: </b>Numeric OS version Example .  
- <b>Census_OSArchitecture: </b>Architecture on which the OS is based. Derived from OSVersionFull. Example .  
- <b>Census_OSBranch: </b>Branch of the OS extracted from the OsVersionFull. Example .  
- <b>Census_OSBuildNumber: </b>OS Build number extracted from the OsVersionFull. Example .  
- <b>Census_OSBuildRevision: </b>OS Build revision extracted from the OsVersionFull. Example .  
- <b>Census_OSEdition: </b>Edition of the current OS. Sourced from HKLM\Software\Microsoft\Windows NT\CurrentVersion@EditionID in registry. Example: Enterprise.  
- <b>Census_OSSkuName: </b>OS edition friendly name (currently Windows only).  
- <b>Census_OSInstallTypeName: </b>Friendly description of what install was used on the machine i.e. clean.  
- <b>Census_OSInstallLanguageIdentifier: </b>NA.  
- <b>Census_OSUILocaleIdentifier: </b>NA.  
- <b>Census_OSWUAutoUpdateOptionsName: </b>Friendly name of the WindowsUpdate auto.  
- <b>Census_IsPortableOperatingSystem: </b>Indicates whether OS is booted up and running via Windows.  
- <b>Census_GenuineStateName: </b>Friendly name of OSGenuineStateID. 0 = Genuine.  
- <b>Census_ActivationChannel: </b>Retail license key or Volume license key for a machine..  
- <b>Census_IsFlightingInternal: </b>Flighting' in Windows Defender context means making new development features available as soon as possible, during the development cycle. This does not refer to a public release. The 'internal' most likely means the Window Insider community.  
- <b>Census_IsFlightsDisabled: </b>Indicates if the machine is participating in flighting..  
- <b>Census_FlightRing: </b>The ring that the device user would like to receive flights for. This might be different from the ring of the OS which is currently installed if the user changes the ring after getting a flight from a different ring..  
- <b>Census_ThresholdOptIn: </b>NA.  
- <b>Census_FirmwareManufacturerIdentifier: </b>NA.  
- <b>Census_FirmwareVersionIdentifier: </b>NA.  
- <b>Census_IsSecureBootEnabled: </b>Indicates if Secure Boot mode is enabled. Secure Boot is a security measure to protect against malware during early system startup..  
- <b>Census_IsWIMBootEnabled: </b>wimboot is a boot loader for Windows Imaging Format .wim files. It enables you to boot into a Windows PE (WinPE) deployment or recovery environment..  
- <b>Census_IsVirtualDevice: </b>Identifies a Virtual Machine (machine learning model).  
- <b>Census_IsTouchEnabled: </b>Is this a touch device ?.  
- <b>Census_IsPenCapable: </b>Is the device capable of pen input ?.  
- <b>Census_IsAlwaysOnAlwaysConnectedCapable: </b>Retreives information about whether the battery enables the device to be AlwaysOnAlwaysConnected.  
- <b>Wdft_IsGamer: </b>Indicates whether the device is a gamer device or not based on its hardware combination..  
- <b>Wdft_RegionIdentifier: </b>Region id code.  
- <b>HasDetections: </b>indicates that Malware was detected on the machine.  

In [5]:
df_mmp.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Unnamed: 0,500000.0,4458888.0,2575619.0,2.0,2227692.5,4461367.5,6690936.0,8921471.0
IsBeta,500000.0,2e-06,0.001414214,0.0,0.0,0.0,0.0,1.0
RtpStateBitfield,498168.0,6.846207,1.023049,0.0,7.0,7.0,7.0,35.0
IsSxsPassiveMode,500000.0,0.017242,0.130172,0.0,0.0,0.0,0.0,1.0
DefaultBrowsersIdentifier,24061.0,1652.825,1004.754,1.0,788.0,1632.0,2381.0,3209.0
AVProductStatesIdentifier,498062.0,47850.91,14023.09,3.0,49480.0,53447.0,53447.0,70492.0
AVProductsInstalled,498062.0,1.326763,0.5229999,1.0,1.0,1.0,2.0,5.0
AVProductsEnabled,498062.0,1.020714,0.166608,0.0,1.0,1.0,1.0,4.0
HasTpm,500000.0,0.987816,0.1097068,0.0,1.0,1.0,1.0,1.0
CountryIdentifier,500000.0,108.0375,63.06854,1.0,51.0,97.0,162.0,222.0


### Tipo de atributos disponibles.

In [6]:
df_mmp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 84 columns):
 #   Column                                             Non-Null Count   Dtype  
---  ------                                             --------------   -----  
 0   Unnamed: 0                                         500000 non-null  int64  
 1   MachineIdentifier                                  500000 non-null  object 
 2   ProductName                                        500000 non-null  object 
 3   EngineVersion                                      500000 non-null  object 
 4   AppVersion                                         500000 non-null  object 
 5   AvSigVersion                                       500000 non-null  object 
 6   IsBeta                                             500000 non-null  int64  
 7   RtpStateBitfield                                   498168 non-null  float64
 8   IsSxsPassiveMode                                   500000 non-null  int64 

### Estadísticos descriptivos.

In [7]:
df_mmp.isna().sum()

Unnamed: 0                                                0
MachineIdentifier                                         0
ProductName                                               0
EngineVersion                                             0
AppVersion                                                0
AvSigVersion                                              0
IsBeta                                                    0
RtpStateBitfield                                       1832
IsSxsPassiveMode                                          0
DefaultBrowsersIdentifier                            475939
AVProductStatesIdentifier                              1938
AVProductsInstalled                                    1938
AVProductsEnabled                                      1938
HasTpm                                                    0
CountryIdentifier                                         0
CityIdentifier                                        18240
OrganizationIdentifier                  

In [9]:
df_mmp['Wdft_RegionIdentifier'].value_counts(normalize=True)*100

Wdft_RegionIdentifier
10.0    20.782942
11.0    15.653038
3.0     15.126591
1.0     14.220474
15.0    11.871649
7.0      6.906531
8.0      3.262809
13.0     2.625194
5.0      2.373667
12.0     1.886968
6.0      1.811407
4.0      1.570438
9.0      0.935928
2.0      0.925370
14.0     0.046993
Name: proportion, dtype: float64