# HW-5: Malware Classification (Due 5th January, 2023)

**Instructions:**

Suppose your company is struggling with a series of computer virus attacks for the past several months. The viruses were grouped into a few types with some effort. However, it takes a long time to sort out what kind of virus it is when been hit with. Thus, as a senior IT department member, you undertook a project to classify the virus as quickly as possible. You've been given a dataset of the features that may be handy (or not), and  also the associated virus type (target variable). 

You are supposed to try different classification methods and apply best practices we have seen in the lectures such as grid search, cross validation, regularization etc. To increase your grade you can add more elaboration such as using ensembling or exploiting feature selection/extraction techniques. **An evaluation rubric is provided.**

Please prepare a python notebook that describes the steps, present the results as well as your comments. 

You can download the data (csv file) [here](https://drive.google.com/file/d/1yxbibzUU8bjOyChDVFPfQ4viLduYdk29/view?usp=sharing).


In [25]:
import pandas as pd
import numpy as np
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import mutual_info_regression
from sklearn.model_selection import cross_val_score



In [23]:
df = pd.read_csv('hw5.csv')

X = df.drop("target", axis=1)
y = df['target']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)#Split the data as train and test


In [40]:
from sklearn import svm


#Hyperparameters set

dt_parameters = {
    'criterion':['gini','entropy'],
    'max_depth': [2, 4, 6],
    'min_samples_split': [2, 4, 6],
    'min_samples_leaf': range(1,5)
}

svm_parameters =  {'C': [0.1,1, 10, 100], 'gamma': [1,0.1,0.01,0.001],}#When I include Kernel parameter run time extremely increase

rf_parameters = { 
    
    'max_features': ['auto'],
    'max_depth' : [4,5,6,7,],
    'criterion' :['gini', 'entropy']
}


knn_parameters = {
    'n_neighbors' : [3,5,11,19],
    'weights' : ['uniform', 'distance'],
    'metric' : ['euclidean', 'manhattan']
}





dt = tree.DecisionTreeClassifier()
knn = KNeighborsClassifier()
svm = svm.SVC()
rf = RandomForestClassifier()
nb = GaussianNB()
lr = LogisticRegression(max_iter=100000)



dt_model = GridSearchCV(dt, dt_parameters, n_jobs=-1)
knn_model = GridSearchCV(knn, knn_parameters, n_jobs=-1)
svm_model = GridSearchCV(svm, svm_parameters, n_jobs=-1)
rf_model = GridSearchCV(rf, rf_parameters, n_jobs=-1)





ensemble = VotingClassifier(estimators=[('dt', dt_model), ('knn', knn_model), ('rf', rf_model), ('nb', nb)], voting='hard')#Ensembling applied with the algorithms

selected_features= []#Features selected with different reduced sizes.

for n in [25, 50, 100, 200]:
    
    selector = SelectKBest(f_classif, k=n)

    selector.fit(X_train, y_train)

    X_train_selected = selector.transform(X_train)
    
    X_test_selected = selector.transform(X_test)

    selected_features.append([X_train_selected, X_test_selected])


lr.fit(selected_features[0][0], y_train)

STOP: TOTAL NO. of f AND g EVALUATIONS EXCEEDS LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression(max_iter=100000)

In [35]:
#K-fold cross validation applied
scores_knn = cross_val_score(knn, X_train, y_train, cv=5, scoring='accuracy')
print("Knn cross validation score :",scores_knn.max())

scores_rf = cross_val_score(rf, X_train, y_train, cv=5, scoring='accuracy')
print("Random Forest cross validation score :", scores_rf.max())

scores_svm = cross_val_score(svm, X_train, y_train, cv=3, scoring='accuracy')#I choose k = 3 because of runtime
print("Svm cross validation score :", scores_svm.max())

scores_dt = cross_val_score(dt, X_train, y_train, cv=5, scoring='accuracy')
print("Decision Tree cross validation score :", scores_dt.max())




Knn cross validation score : 0.765
Random Forest cross validation score : 0.9148936170212766
Svm cross validation score : 0.36534133533383345
Decision Tree cross validation score : 0.80875


In [45]:
size = [25, 50, 100, 200]
n = 0
for selected in selected_features:


    dt_model.fit(selected[0], y_train)
    dt_predictions = dt_model.predict(selected[1])
    print("Decision Tree n=",size[n],"Score :", accuracy_score(y_test, dt_predictions))
    # print best parameter after tuning
    print(dt_model.best_params_)


    knn_model.fit(selected[0], y_train)
    knn_predictions = knn_model.predict(selected[1])
    print("KNN n=",size[n], "Score :", accuracy_score(y_test, knn_predictions))
    # print best parameter after tuning
    print(knn_model.best_params_)  


    svm_model.fit(selected[0], y_train)
    svm_predictions = svm_model.predict(selected[1])
    print("SVM n=",size[n], "Score :", accuracy_score(y_test, svm_predictions))
    # print best parameter after tuning
    print(svm_model.best_params_)


    rf_model.fit(selected[0], y_train)
    rf_predictions = rf_model.predict(selected[1])
    print("Random Forest n=",size[n], "Score :", accuracy_score(y_test, rf_predictions))
    # print best parameter after tuning
    print(rf_model.best_params_)

    nb.fit(selected[0], y_train)
    nb_predictions = nb.predict(selected[1])
    print("Naive Bayes n=",size[n], "Score :", accuracy_score(y_test, nb_predictions))
    

    
    lr_predictions = lr.predict(selected[1])
    print("Logistic Regression n=",size[n], "Score :", accuracy_score(y_test, lr_predictions))

    ensemble.fit(selected[0], y_train)
    ensemble_predictions = ensemble.predict(selected[1])
    print("Ensemble n=",size[n], "Score :", accuracy_score(y_test, ensemble_predictions))

    print("\n")
    n+=1


Decision Tree n= 25 Score : 0.836
{'criterion': 'entropy', 'max_depth': 6, 'min_samples_leaf': 1, 'min_samples_split': 4}
KNN n= 25 Score : 0.858
{'metric': 'manhattan', 'n_neighbors': 11, 'weights': 'distance'}
SVM n= 25 Score : 0.278
{'C': 0.1, 'gamma': 1}
Random Forest n= 25 Score : 0.89
{'criterion': 'entropy', 'max_depth': 7, 'max_features': 'auto'}
Naive Bayes n= 25 Score : 0.66
Logistic Regression n= 25 Score : 0.716


STOP: TOTAL NO. of f AND g EVALUATIONS EXCEEDS LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Ensemble n= 25 Score : 0.876


Decision Tree n= 50 Score : 0.847
{'criterion': 'entropy', 'max_depth': 6, 'min_samples_leaf': 4, 'min_samples_split': 2}
KNN n= 50 Score : 0.869
{'metric': 'manhattan', 'n_neighbors': 5, 'weights': 'distance'}
SVM n= 50 Score : 0.278
{'C': 0.1, 'gamma': 1}
Random Forest n= 50 Score : 0.9
{'criterion': 'entropy', 'max_depth': 7, 'max_features': 'auto'}
Naive Bayes n= 50 Score : 0.62


ValueError: ignored

# Hyperparameter search and scores

Decision Tree n= 25 Score : 0.836
{'criterion': 'entropy', 'max_depth': 6, 'min_samples_leaf': 1, 'min_samples_split': 4}

KNN n= 25 Score : 0.858
{'metric': 'manhattan', 'n_neighbors': 11, 'weights': 'distance'}

SVM n= 25 Score : 0.278
{'C': 0.1, 'gamma': 1}

Random Forest n= 25 Score : 0.89
{'criterion': 'entropy', 'max_depth': 7, 'max_features': 'auto'}

Naive Bayes n= 25 Score : 0.66

Logistic Regression n= 25 Score : 0.716

Ensemble n= 25 Score : 0.876



I get the highest score with using Random Forest .