# Android malware analysis

This notebook represents AI4 individual challenge, it provides insight into my work in the area of cybersecurity, and how I use data science to train machine learning algorithms to detect malware.

# Background

Android is the most popular OS used on mobile systems[1] on the planet having a 75% of mobile OS share. Moreover,  it is the most popular OS worldwide, beating even Microsoft Windows and macOS. Several reasons contribute to its popularity, first and foremost it is open-sourced which contributed greatly to its adoption in the market. However, due to its core design philosophy, it also allows users to install third-party applications without central control, thus becoming a malware target. Even though it includes security mechanisms, cybersecurity is a never-ending race between hackers and cybersecurity specialists, making the development of frameworks and methods to improve its security.

# Analysis

Two approaches can be used to analyze cyber threads, static analysis, and dynamic analysis. 

Static analysis: allows to get information about software(predict) without executing it, this can include studying the code, calls, and access to resources.

Dynamic analysis: another approach to analysis and prediction of cyber threats is to get information while they are executed, and gather information about their behavior ex. network traffic. 

In this project, the static analysis will be used to detect malware.

# Dataset

The chosen dataset is Drebin, it is a dataset that consists of malware samples taken between 2012 and 2015. This set has a size of 15036 applications that are grouped into 49 families. It contains two features sets: API calls and permission manifest.

Importing necessary libraries for this project

In [189]:
import numpy as np 
import pandas as pd 
import time
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, f_regression
from numpy import array
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import Lasso
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.experimental import enable_halving_search_cv  # noqa
from sklearn.model_selection import HalvingGridSearchCV
import eli5
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Lasso

Reading dataset into the Jupyter notebook

In [50]:
df_malware = pd.read_csv("C:\\Users\\Marcin\\Desktop\\fontys\\4th semester\\Individual Challenge\challenge 1\Chosen Dataset\\drebin\\challenge dataset.csv",
                        low_memory = False)
df_malware.head(5)

Unnamed: 0,transact,onServiceConnected,bindService,attachInterface,ServiceConnection,android.os.Binder,SEND_SMS,Ljava.lang.Class.getCanonicalName,Ljava.lang.Class.getMethods,Ljava.lang.Class.cast,...,READ_CONTACTS,DEVICE_POWER,HARDWARE_TEST,ACCESS_WIFI_STATE,WRITE_EXTERNAL_STORAGE,ACCESS_FINE_LOCATION,SET_WALLPAPER_HINTS,SET_PREFERRED_APPLICATIONS,WRITE_SECURE_SETTINGS,CLASS
0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,1,0,0,0,0,S
1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,1,0,0,0,0,S
2,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,S
3,0,0,0,0,0,0,0,0,0,1,...,0,0,0,1,1,1,0,0,0,S
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,0,0,S


Display of dataset information: number of columns(features) and rows, as well as the type of data.

In [3]:
df_malware.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15036 entries, 0 to 15035
Columns: 216 entries, transact to CLASS
dtypes: int64(214), object(2)
memory usage: 24.8+ MB


This dataset contains 15036 entries(15036 tested applications), it has 216 columns and 215 features. Moreover, 215 data types are integers and one object: class. Class defines whether the tested sample was malware or not (S=Malware, B=Benign).

Display of target variable balance

In [4]:
print(df_malware['CLASS'].value_counts())

B    9476
S    5560
Name: CLASS, dtype: int64


Dataset contains 9476 benign samples and 5560 malware samples. 36.9% of dataset consists of malware samples.

# Feature selection

During the initial stages of the project, the 215 features were deemed to be too much for prediction(possibility of overfitting). Hence, the KBest, as well as Lasso algorithms, were used to find out the best statistical features to be used with ML algorithm. However, the literature research and University of Gottingen[1] research pointed out that malware analysis can be done with large feature sets, well beyond 600 featues[1]. Thus, all features of the dataset will be taken into the account in this project. 

Note: The feature selection has been moved to the bottom of the notebook.

# Check for null values

To feed this data into ML algorithm, I'm making sure that this dataset does not contain any null or missing values that could either skew the results or do not work at all. The code below will check if there are any of them in the dataset.

In [5]:
df_malware.isnull().values.sum()

0

The number of missing values is 0, which means that this dataset does not contain any of them. This allows me to start applying ML algorithm to the dataset.

# Dataset Transformation

In the section below I'm transforming the dataset to eliminate strings inside it. Replacement of CLASS values to binary(1 - Malware, 0 - Benign) and changing the type to int64. 

In [51]:
df_malware['CLASS'] = df_malware['CLASS'].replace(
    to_replace=['S'], 
    value='1')
df_malware['CLASS'] = df_malware['CLASS'].replace(
    to_replace=['B'], 
    value='0')
df_malware['CLASS'] = df_malware['CLASS'].astype('int64')

# Data Preparation

In the section below I'm selecting my target variable(CLASS).

In [52]:
#selection of target
y = df_malware.iloc[:,-1] #target variable
X = df_malware.drop(['CLASS'], axis = 1) #dataset

After running the selectKbest several times, each time python method has executed with an error, unknown sign "?". After close inspection of the dataset I was unable to locate the source of question mark. My search included: searching through dataset with pandas build in functions did not yield any results.

In [10]:
df_malware[df_malware.eq("?").any(1)]

Unnamed: 0,transact,bindService,onServiceConnected,ServiceConnection,android.os.Binder,READ_SMS,attachInterface,WRITE_SMS,TelephonyManager.getSubscriberId,Ljava.lang.Class.getCanonicalName,...,Ljava.lang.Object.getClass,SET_ORIENTATION,DEVICE_POWER,EXPAND_STATUS_BAR,GET_TASKS,GLOBAL_SEARCH,GET_PACKAGE_SIZE,SET_PREFERRED_APPLICATIONS,android.intent.action.PACKAGE_CHANGED,CLASS


However, the search function of Microsoft Excel can find them inside the dataset. To remove them, I'm converting both the target variable and the dataset to the numeric type. Moreover, I'm replacing the NaN values(which do not exist - see above) with 0. The combination of both methods removes the "?" from the dataset. As of now, I wasn't able to pinpoint why this happens(maybe formatting/import error).

In [53]:
#removal of "?" sign - throws error without it
y=y.apply(pd.to_numeric, errors = 'coerce')
X=X.apply(pd.to_numeric, errors = 'coerce')

X.fillna(0, inplace=True)
y.fillna(0, inplace=True)

The next cell represents the splitting of the dataset into train and test, random_state is set to a specific number to ensure that each time it results in the same split.

In [54]:
#splitting data into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature scaling

In [136]:
#sc = StandardScaler()
#sc.fit(X_train)
#X_train_std = sc.transform(X_train)
#X_test_std = sc.transform(X_test)

# Prediction classifier

After consulting the lecturers and litrature review, the best suited classifier for large feature sets is SVM[2]. SVM determines the hyperplane that seperates both classes with maximum margin, and given the vectors(2 classes), classification is based on where the application could lands.  

Hyperplane[2] <img src="Hyperplane.jpg" width=400 height=400 />

First iteration of SVM with deafult values, that will be compared to parameter tuned one.

In [10]:
svm_malware = SVC()
svm_malware.fit(X_train, y_train)
pred = svm_malware.predict(X_test)
acc = metrics.accuracy_score(pred, y_test)*100
print("Accuracy without tuned hyperparameters is : {:.2f}%".format(acc))

Accuracy without tuned hyperparameters is : 98.11%


Due to high initial accuracy, it is conceivable that SVM model is overfitting the data.

# Hyper-parameter tuning

The machine learning model is defined as a mathematical model with parameters. Those parameters have to be learned from the available data. ML models also use parameters called "hyper-parameters", they are responsible for better "fitting" of the model to the data(improving complexity, learning rate). Often chosen by trial and error. Models can have multiple hyper-parameters that influence their behavior. SVM alone has 15 hyper-parameters that can changed. Although, not all at once, as some of them are only specific to certain kernels. 

To avoid manual tuning, python has build-in tools to help with hyper-parameter tuning. The best solution to this is to create a parameter grid, with all possible combinations. Moreover, it is recommened that hyper-parameter tuning done with parameters spaced as much as possible[1]( ex. C value from 1 to 1000). 

Sklearn implementation of this approach is SearchGridCV, which takes dictionary of parameters and use brute force method by trying every possible combination. Brute force methods are by design computation intensive. 

Test parameters will be used ilustrate computation time.

In [17]:
param_grid_test = {'C': [1, 10, 100],
                   'gamma': [1, 0.1, 0.01, 0.001], 
                   'kernel':['rbf','linear','poly']
                  }

GridSearchCV is a brute force estimator, this means that all possible combinations of hyper-parameters will be used to calculate the best hyper-parameters. In the example below, cross-validation is set to 4, meaning that the model will have 400 interations. This is computing-intensive and in larger models will take a significant amount of time[3].

One way to solve to computational time issue is to use RandomizedSearch that controls the number of iterations(n_iter parameter), however this method by design limit the number of iterations on the model, and will influence the result(in a negative way), as it will stop iterating(searching for) the best combination of hyper-parameters after a specified number of iterations.

Example below demonstarates the GridSearchCV on a small set of hyper-parameters.

In [14]:
%%time
grid = GridSearchCV(svm_malware, param_grid_test, refit = True, verbose = 0 ,cv=4)
  
#fitting the model for grid search
grid.fit(X_train, y_train)

print("tuned hyperparameters :(best parameters) ",grid.best_params_) 

print("accuracy :",grid.best_score_*100)

tuned hyperparameters :(best parameters)  {'C': 100, 'gamma': 0.1, 'kernel': 'rbf'}
accuracy : 98.98570003325574
CPU times: total: 29min 1s
Wall time: 29min 10s


In 2020 Sklearn added to their library, HalvingGridSearch which takes a different approach to the calculation of best hyper-parameters - successive halving. Like GridSearch it uses iterations to determine the best hyper-parameters however, all candidates are only trained on the first iteration(with limited resources) and only the ones with the best score are moved to another(and granted more resources each next iteration). Thus, reducing computational time drastically by eliminating candidates at the initial stage rather than iterating through them even though they produce lower scores.

The speed of the process can be influenced by two parameters: min_resources which specifies how much data is being fed to it at the first iteration and factor - by how more data is fed to it in the successive iterations. However, due to this approach, it is possible to set min_resources and factor in such a way that the available resources will run out before all iterations are completed, to avoid that min_resources can be set to "exhaust" to automatically assign resources to iterations so that all of them will pass. The factor is set to 3, which is the recommendation of Sklearn(default, most reliable, and compatible). 

In [15]:
%%time
halved_test = HalvingGridSearchCV(svm_malware, param_grid_test,n_jobs=-1, 
                                  min_resources="exhaust", factor=3, cv = 5)

halved_test.fit(X_train, y_train)

print("tuned hyperparameters :(best parameters) ",halved_test.best_params_) 

print("accuracy :",halved_test.best_score_*100)

tuned hyperparameters :(best parameters)  {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}
accuracy : 99.02606476847045
CPU times: total: 11.1 s
Wall time: 1min 56s


After running the test samples, we can see that HalvingGrid search is much faster than brute force method of GridSearch. To reduce computation time this project will use HalvingGridSearch to fine tune the hyper parameters. 

# HalvingGridSearchCV - hyperparameter tuning for malware detection

Parameters grid for malware prediction. A chosen array of numbers represent as wide as possible a hyper-parameter set(computationally).

C - is the cost of misclassification of data. 
High C will produce low bias and high variance - because the cost of misclassification increases.
Low C will produce high bias and low variance - as the decision barrier is lowered(misclassification cost is lowered).

Gamma(rbf kernel) influences the distance of training of the single point. Meaning low gamma values will produce high bais as the distance(similarity radius) for classification is broadened. High gamma values will result in low bais as the distance of classification is reduced(points on the hyperplane need to be close together to be classified as the same category). This hyper-parameter has to be positive.

In order to find the "sweet spot" for the malware prediction code below will test a wide array of parameters, spaced as much as possible while retaining reasonable computation time[3]. 

# Kernel choice

First choosing the kernel as hyper-tuning of parameters will be depended on it.

In [19]:
param_grid_kernel = {'kernel':['linear', 'poly', 'sigmoid','rbf']}

In [20]:
%%time
halved_kernel = HalvingGridSearchCV(svm_malware, param_grid_kernel,n_jobs=-1, min_resources="exhaust", factor=3, cv = 5)

halved_kernel.fit(X_train, y_train)

print("tuned kernel : ",halved_kernel.best_params_) 

print("accuracy :",halved_kernel.best_score_*100)

tuned kernel :  {'kernel': 'rbf'}
accuracy : 98.22012515523332
CPU times: total: 5.81 s
Wall time: 1min 20s


The chosen kernel is rbf, thus all of the following tuning will be set-up for rbf.

# Hyper-parameters tuning(rbf kernel)

Originally planned parameters for grid search, however computation time proved to too significant, moreover at some point system has ran out of memory and crashed.

In [18]:
svm_C = [0]
for i in np.arange(0.1,1000,0.1):
    svm_C.append(i)
    
svm_gamma = [0]
for i in np.arange(0.001,100,0.001):
    svm_gamma.append(i)
svm_gamma.append('gamma')

Smaller parameter tuning range  

In [21]:
# defining parameter range
param_grid_hyper = {'C': [0.001,0.01,0.1,1,10,100,1000], 
              'gamma': [0.001, 0.01,0.1,1,10,100], 'kernel':['rbf']}

In [22]:
%%time
halved_hyper = HalvingGridSearchCV(svm_malware, param_grid_hyper,n_jobs=-1, min_resources="exhaust", factor=3, cv = 5)

halved_hyper.fit(X_train, y_train)

print("tuned kernel :(best parameters) ",halved_hyper.best_params_) 

print("accuracy :",halved_hyper.best_score_*100)

tuned kernel :(best parameters)  {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}
accuracy : 99.02606476847045
CPU times: total: 11.1 s
Wall time: 2min 59s


# Model prediction with tuned hyper-parameters

In [23]:
svm_malware_tuned = SVC(C = 10, kernel = 'rbf',gamma = 0.1)
svm_malware_tuned.fit(X_train, y_train)
pred_tuned = svm_malware_tuned.predict(X_test)
acc_tuned = metrics.accuracy_score(pred_tuned, y_test)*100
print("Accuracy with tuned hyperparamters is : {:.2f}%".format(acc_tuned))

Accuracy with tuned hyperparamters is : 98.84%


The accuracy of prediction has increased by 0,73% after tuning the hyper-parametrs. It might be due to the small sample size, as only 2 sets of features are being taken into account. Research[2] presented by Daniel Arp, Michael Spreitzenbarth , Malte Hubner, Hugo Gascon, and Konrad Rieck, suggests that malware research is best done with large sample sizes as in thier reaserch they used over 131,000 samples which were divivded into 7 feature sets.

# Cross validation

The possibility of overfitting the model is a concern while training the model on training data. Therefore to avoid overfitting I'm applying the 10-fold cross-validation. 

In [22]:
cv = KFold(n_splits=10, random_state=1, shuffle=True)

#use k-fold CV to evaluate model
scores = cross_val_score(svm_malware_tuned, X, y, cv=cv, n_jobs=-1)

print("Accuracy scores: " + np.array_str(scores))
print("Avg accuracy: {}".format(scores.mean()))

Accuracy scores: [0.9900266  0.98803191 0.99135638 0.99268617 0.99135638 0.99135638
 0.99201597 0.99001996 0.99001996 0.99001996]
Avg accuracy: 0.9906889678090627


Accuracy scores closely match the results after the hyper-parameter tuning. Meaning that the model is most likely not overfitted.

# Evaluation

In [23]:
print(metrics.classification_report(y_test, pred_tuned))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      1863
           1       0.99      0.98      0.99      1145

    accuracy                           0.99      3008
   macro avg       0.99      0.99      0.99      3008
weighted avg       0.99      0.99      0.99      3008



The classification report contains 2 classes: 1 and 0. As mentioned before 1 is malware and 0 is benign. Both of them exhibit high precision and recall. Meaning that model correctly indentifies that given sample is a malware, moreover class 1, has slighlty lower recall(0,01%), which means that model might sometimes identify benign application as malware. Considering potential dangers that come with malware, high precision is more important than recall. The possibily blocking benign application is preferable to not-blocking actual malware.

# Accuracy comperasion between training and test datasets


In [24]:
print(f'Training Accuracy: {svm_malware_tuned.score(X_train,y_train)*100}%')
print(f'Model Accuracy: {svm_malware_tuned.score(X_test,y_test)*100}%')

Training Accuracy: 99.90023279015631%
Model Accuracy: 98.86968085106383%


High training accuracy can be the result of duplicated rows, as dataset is not imbalanced.

# Dataset copy

In [55]:
df_malware_duplicates=df_malware.copy(deep = True)

# Checking for duplicates

In [56]:
df_malware_duplicates.duplicated(subset= None, keep='first').sum()

7775

There are 7775 duplicate rows in this dataset.

# Removing the duplicates

In [57]:
df_malware_duplicates.drop_duplicates(subset=None, keep='first', inplace=True)
df_malware_duplicates.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7261 entries, 0 to 15033
Columns: 216 entries, transact to CLASS
dtypes: int64(215), object(1)
memory usage: 12.0+ MB


Choosing target variable from dataset with removed duplicates

In [58]:
y_dup = df_malware_duplicates.iloc[:,-1] #target variable
X_dup = df_malware_duplicates.drop(['CLASS'], axis = 1) #dataset

#removal of "?" sign - throws error without it
y_dup=y.apply(pd.to_numeric, errors = 'coerce')
X_dup=X.apply(pd.to_numeric, errors = 'coerce')

X_dup.fillna(0, inplace=True)
y_dup.fillna(0, inplace=True)

Splitting the dataset into train and test

In [63]:
X_train_dup, X_test_dup, y_train_dup, y_test_dup = train_test_split(X_dup, y_dup, test_size=0.2, random_state=42)

SVM prediction with tuned hyperparamets and removed duplicates

In [60]:
svm_malware_tuned = SVC(C = 10, kernel = 'rbf',gamma = 0.1)
svm_malware_tuned.fit(X_train_dup, y_train_dup)
pred_tuned = svm_malware_tuned.predict(X_test_dup)
acc_tuned = metrics.accuracy_score(pred_tuned, y_test_dup)*100
print("Accuracy with tuned hyperparamters and removed duplicates is : {:.2f}%".format(acc_tuned))

Accuracy with tuned hyperparamters and removed duplicates is : 98.84%


The removal of duplicates has no influence on accuracy, which means that in previous iterations model was fitted correctly.

# Feature selection - manual

In [18]:
feature_selection = {}

In [19]:
%%time
feature_selection.clear()
for x in range(215):
    df = X.copy()
    column_name = df.columns[x]
    df.drop(df.columns[x], axis=1, inplace=True)
    df_train, df_test, y_train, y_test = train_test_split(df,y,test_size=0.2)
    svm_malware_tuned.fit(df_train,y_train)
    pred = svm_malware_tuned.predict(df_test)
    acc = metrics.accuracy_score(pred, y_test)
    feature_selection[column_name] = (100-round(acc,2))

CPU times: total: 41min 45s
Wall time: 42min 9s


The loop above test absence of each feature on accuracy results. Each removed feature is juxtaposed with accuracy penalty after it's removal.

#### Feature sorting and comperasion with KBest and Lasso

In [39]:
feature_selection_sorted=dict(sorted(feature_selection.items(),key= lambda x:x[1]))
for x in list(list(feature_selection_sorted))[0:20]:
    print (x + ":" + str(feature_selection_sorted[x]))

transact:99.01
onServiceConnected:99.01
bindService:99.01
attachInterface:99.01
ServiceConnection:99.01
android.os.Binder:99.01
SEND_SMS:99.01
Ljava.lang.Class.getCanonicalName:99.01
Ljava.lang.Class.getMethods:99.01
Ljava.lang.Class.cast:99.01
Ljava.net.URLDecoder:99.01
android.content.pm.Signature:99.01
android.telephony.SmsManager:99.01
READ_PHONE_STATE:99.01
getBinder:99.01
ClassLoader:99.01
Landroid.content.Context.registerReceiver:99.01
Ljava.lang.Class.getField:99.01
Landroid.content.Context.unregisterReceiver:99.01
GET_ACCOUNTS:99.01


Listed values(first 20 out of all features) all have 99.01% accuracy(in fact all features exhibit the same accuracy), meaning that removal of any feature does not influence prediction after hyper-parameter tuning. 

# Feature selection using KBest

In [215]:
kBest = SelectKBest(score_func=chi2, k=32)
z = kBest.fit(X_dup,y_dup)
selectedFeatures = kBest.transform(X_dup)
mask = kBest.get_support(1)
Selected_Features=X_dup[X_dup.columns[mask]]
print(selectedFeatures.shape)

(15036, 32)


#### Display of selected features

In [216]:
for col_name in Selected_Features.columns: 
    print(col_name)

transact
onServiceConnected
bindService
attachInterface
ServiceConnection
android.os.Binder
SEND_SMS
Ljava.lang.Class.getCanonicalName
Ljava.lang.Class.getMethods
Ljava.lang.Class.cast
Ljava.net.URLDecoder
android.content.pm.Signature
android.telephony.SmsManager
READ_PHONE_STATE
getBinder
ClassLoader
Landroid.content.Context.registerReceiver
Ljava.lang.Class.getField
Landroid.content.Context.unregisterReceiver
GET_ACCOUNTS
RECEIVE_SMS
Ljava.lang.Class.getDeclaredField
READ_SMS
getCallingUid
Ljavax.crypto.spec.SecretKeySpec
android.intent.action.BOOT_COMPLETED
USE_CREDENTIALS
MANAGE_ACCOUNTS
TelephonyManager.getLine1Number
DexClassLoader
WRITE_SMS
android.telephony.gsm.SmsManager


#### Prediction on kBest features

In [217]:
# splitting dataset
X_train_kBest, X_test_kBest, y_train_kBest, y_test_kBest = train_test_split(selectedFeatures, y_dup, test_size=0.2, random_state=42) 

In [218]:
svm_malware_tuned = SVC(C = 10, kernel = 'rbf',gamma = 0.1)
svm_malware_tuned.fit(X_train_kBest, y_train_kBest)
pred_tuned = svm_malware_tuned.predict(X_test_kBest)
acc_tuned = metrics.accuracy_score(pred_tuned, y_test_kBest)*100
print("Accuracy after kBest selected features:" + str(acc_tuned)+ "%")

Accuracy after kBest selected features:95.0465425531915%


Feature selection with the kBest algorithm reduces the accuracy of the selected model, thus reinforcing previous findings that all features contribute positively to prediction.

# Feature selection using Lasso

Lasso is a regression algorithm that stands for Least Absolute Shrinkage and Selection Operator. It is a statistical formula for the regularization of models and can be used to select features. Shrinkage is where data value is shrunk towards the central point(mean). The regression model uses L1 Regularization, it adds a penalty that is equal to the absolute value of the magnitude of the coefficient. If two features are correlated their simultaneous presence will increase the cost of the function, and Lasso regression will shrink the values of less important features to 0, to select the best ones.

#### Lasso model

In [100]:
lasso_model = Lasso()

Before choosing the feature, it is advisable to tune the hyper-parameters of the Lasso algorithm. In this case, it is the α value, which is the constant multiplier in L1 Regularization.

#### Defining parameter grid

In [129]:
parameter_grid_lasso = {'alpha': np.arange(0.1,10,0.1)}

#### Halving grid search for best parameter

In [161]:
lasso_hyper = HalvingGridSearchCV(lasso_model,parameter_grid_lasso, min_resources="exhaust",cv = 5, scoring="neg_mean_squared_error")
search_results = lasso_hyper.fit(X_dup,y_dup)
print('Best parameter value is: '+ str(search_results.best_params_))

Best parameter value is: {'alpha': 0.1}


#### Logistic Regression feature selection

In [219]:
#L1 for Lasso
from sklearn.linear_model import LogisticRegression 
logreg = LogisticRegression(penalty='l1', C=0.1, solver='liblinear')
logreg.fit(X_train_dup, y_train_dup)
eli5.show_weights(logreg, top=-1, feature_names = X_train_dup.columns.tolist())

Weight?,Feature
2.798,SEND_SMS
2.797,android.telephony.gsm.SmsManager
1.616,INTERNET
1.436,android.telephony.SmsManager
1.27,chmod
1.207,READ_HISTORY_BOOKMARKS
1.135,TelephonyManager.getDeviceId
1.113,TelephonyManager.getSubscriberId
1.096,Ljava.lang.Class.getResource
0.935,Runtime.exec


#### Data transformation

All features with coeff greater than 0 will be taken into account during prediction(32 features have  coeff > 0).

In [211]:
coef_df=pd.DataFrame(logreg.coef_,columns=X_train.columns).T.reset_index()
coef_df.columns =['Feature', 'Coef']
coef_df_sorted = coef_df[coef_df.Coef > 0]
Selected_Features = X_dup[coef_df_sorted['Feature']]

#### Prediction of Lasso selected features

In [213]:
# splitting dataset
X_train_lasso, X_test_lasso, y_train_lasso, y_test_lasso = train_test_split(Selected_Features, y_dup, test_size=0.2, random_state=42) 

In [214]:
svm_malware_tuned = SVC(C = 10, kernel = 'rbf',gamma = 0.1)
svm_malware_tuned.fit(X_train_lasso, y_train_lasso)
pred_tuned = svm_malware_tuned.predict(X_test_lasso)
acc_tuned = metrics.accuracy_score(pred_tuned, y_test_lasso)*100
print("Accuracy after lasso selected features:" + str(acc_tuned)+ "%")

Accuracy after lasso selected features:97.00797872340425%


Logistic Regression is a valuable tool that can be used in feature selection. However, given the nature of this dataset, previous findings, and reduction of accuracy(however much smaller penalty than kBest) one's again the conclusion is that reducing the number of features is not advisable as it reduces the accuracy of the prediction.

# References

1. Vijayanand. C. D, Arunlal. K. S. (2019). Impact of Malware in Modern Society. International Journal of Scientific Research and Engineering Development, 2(3), 593–600.

2. Arp, D., Spreitzenbarth, M., Hübner, M., Gascon, H., & Rieck, K. (2014). Drebin: Effective and Explainable Detection of Android Malware in Your Pocket. Proceedings 2014 Network and Distributed System Security Symposium. Network and Distributed System Security Symposium, San Diego, CA. https://doi.org/10.14722/ndss.2014.23247

3. Lameski, P., Zdravevski, E., Mingov, R., & Kulakov, A. (2015). SVM Parameter Tuning with Grid Search and Its Impact on Reduction of Model Over-fitting. In Y. Yao, Q. Hu, H. Yu, & J. W. Grzymala-Busse (Eds.), Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing (Vol. 9437, pp. 464–474). Springer International Publishing. https://doi.org/10.1007/978-3-319-25783-9_41