<a href="https://colab.research.google.com/github/BMG2-Dev/GitWorkshop/blob/main/Modifiable_ML_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Malware File Detection Using AI Machine Learning

##Imported Libraries

In [None]:
import pandas as pd # Processing and analysis of data
from google.colab import drive # Access of data from the Google drive
import json # Reading JSON file
from sklearn.preprocessing import OneHotEncoder, StandardScaler # Numerical scaling and categorical encoding
from sklearn.compose import ColumnTransformer # Apply different preprocessing for different columns
from sklearn.pipeline import Pipeline # Pipeline creating that combines modeling and preprocessing
from sklearn.ensemble import RandomForestClassifier # Classification model
from sklearn.metrics import classification_report # Performance evaluation of model
import matplotlib.pyplot as plt # for visuals

##Mount Google Drive

In [None]:
# Specifies that we can use data in our Google Drive under '/content/drive'.
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


##Reads JSON File and Formats File Into A Dataframe

In [None]:
# used for reading a .json file and then formatting into a dataframe
def read_and_format_attributes(file):
  # opens and reads json file objects into data
  with open(file, "r") as f:
    data = json.load(f)

  # list to store each file in the file structure
  rows = []

  # Loops through the files in the file struture
  for feature in data:
    row = feature['attributes']         # Starts by extracting a file metadata 'attributes'
    row['label'] = feature['label']     # Adds label to every file in row with its feature label
    rows.append(row)                    # Appends file row to rows list
  df = pd.DataFrame(rows)               # Creates structured dataframe of all files in rows

  return df

# Load datasets
df1 = read_and_format_attributes('/content/drive/MyDrive/AI_courses/ML/train.json')
df2 = read_and_format_attributes('/content/drive/MyDrive/AI_courses/ML/val.json')
df3 = read_and_format_attributes('/content/drive/MyDrive/AI_courses/ML/test.json')
df1.head()

##Intialize The Training, Validation, and Test DataFrames

In [None]:
# Separate features and labels for training, validation, and testing sets
X_train = df1.drop(columns=['label'])    # intilizes training dataset without non-attribute column 'label'
y_train = df1['label']                   # intilizes target training labels

X_vali = df2.drop(columns=['label'])     # intilizes validation dataset without non-attribute column 'label'
y_vali = df2['label']                    # intilizes target validation labels

X_test = df3.drop(columns=['label'])     # intilizes test dataset without non-attribute column 'label'
y_test = df3['label']                    # intilizes target test labels

# Define categorical, boolean, and numeric features
categoric_features = ['extension', 'entropy']    # Categoric features for one-hot encoded
numeric_features = ['created_minutes_ago']       # Numeric feature for scaling
boolean_features = ['double_extension', 'executable', 'hidden', 'system', 'hash_in_malware_db'] # Boolean features

# Preprocessing pipeline
preprocessing = ColumnTransformer( transformers=[('categoric', OneHotEncoder(handle_unknown='ignore'), categoric_features),
                                                ('numeric', StandardScaler(), numeric_features),
                                                ('boolean', 'passthrough', boolean_features)] )

##Preprocessing and Random Forest Pipeline

In [None]:
# Creates pipeline with preprocessing and classifier combination
pc_pipeline = Pipeline( steps=[('preprocessing', preprocessing),
                                ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))] )

# Train model on training data
pc_pipeline.fit(X_train, y_train)

# Validation Set Evaluation
vali_predi = pc_pipeline.predict(X_vali)
vali_results = classification_report(y_vali, vali_predi, output_dict=True)
vali_results_df = pd.DataFrame(vali_results).transpose()

# Test Set Evaluation
test_predi = pc_pipeline.predict(X_test)
test_results = classification_report(y_test, test_predi, output_dict=True)
test_results_df = pd.DataFrame(test_results).transpose()

##Evaluate Model Performance

In [None]:
print("Validation Results:")
print(vali_results_df)

print("\nTest Results:")
print(test_results_df)

Validation Results:
              precision    recall  f1-score      support
Benign         0.762850  0.837179  0.798289   780.000000
Ransomware     0.258824  0.253846  0.256311   260.000000
Spyware        0.267782  0.246154  0.256513   260.000000
Trojan         0.252381  0.203846  0.225532   260.000000
accuracy       0.535897  0.535897  0.535897     0.535897
macro avg      0.385459  0.385256  0.384161  1560.000000
weighted avg   0.511256  0.535897  0.522204  1560.000000

Test Results:
              precision    recall  f1-score      support
Benign         0.755725  0.846154  0.798387   585.000000
Ransomware     0.268156  0.246154  0.256684   195.000000
Spyware        0.262295  0.246154  0.253968   195.000000
Trojan         0.274510  0.215385  0.241379   195.000000
accuracy       0.541026  0.541026  0.541026     0.541026
macro avg      0.390172  0.388462  0.387605  1170.000000
weighted avg   0.512023  0.541026  0.524532  1170.000000


For clean files, the ML AI has a little problem identifying benign files with high accuracy scores.

For the malware types, the overall accuracy score is low. Low recall means malware is having a hard time being identified. The AI ML has a low precision in identifying malware in a file. A low F1 score means that there is trouble distinguishing malware.

This imbalance could cause a class imbalance in identifying benign and malware files, with one type having a larger pool of data or having trouble distinguishing between specific file features.

We could oversample to balance out the model training, balance by giving more importance to these malware types, or add more distinguishing features to the file so that the model can easily identify it.
