<a href="https://colab.research.google.com/github/DanielFadlon/LWSM/blob/main/LWSM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **LWSM (Light-Weight Signature Matching)**
A framework for a resource limited service using PDS (Probabilistic Data Structure) and AI powered methods.

In [1]:
path_to_project = 'drive/MyDrive/AIForCyber/LWSM'
# Define the ember2018 directory path
features_path = f'{path_to_project}/ember2018'
# Define a path to save the vectorized features (to prevent from creating them each time)
vectorized_path = f'{path_to_project}/ember_vectorized.npy'

# Define the given initial blacklist file path
file_path_to_malicous_sha256_ember = f'{path_to_project}/task1_malicous_sha256_ember.txt'

In [2]:
!pip install git+https://github.com/zivido/ember-3.9.git
!pip install lief
!pip install bitarray

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/zivido/ember-3.9.git
  Cloning https://github.com/zivido/ember-3.9.git to /tmp/pip-req-build-h42a7rq4
  Running command git clone --filter=blob:none --quiet https://github.com/zivido/ember-3.9.git /tmp/pip-req-build-h42a7rq4
  Resolved https://github.com/zivido/ember-3.9.git to commit bc8fe8fe990ed172c69320d13b757b384b8cc367
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: ember
  Building wheel for ember (setup.py) ... [?25l[?25hdone
  Created wheel for ember: filename=ember-0.1.0-py3-none-any.whl size=13096 sha256=f7d254ef7200b20eb6cfcb3ebc26a0d86497359d5b077e4265a42a31b8e6c470
  Stored in directory: /tmp/pip-ephem-wheel-cache-jholkkjq/wheels/8a/de/4f/e2eadc2237ea0e00160b8a5b067e4e2d8ef94f361258d38dcf
Successfully built ember
Installing collected packages: ember
Successfully installed ember-0.1.0
Looking in 

In [3]:
# Imports
import ember
import math
import numpy as np
import hashlib
import bitarray

In [4]:
# using drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## **EMBER Data Set**

- Investigate the data set.
- Bla Bla ... Say more about wht we are doing in this part 

In [None]:
# Check if the vectorized features have already been saved
try:
    X_vectorized = np.load(vectorized_path)
    print('Loaded vectorized features from file')
except FileNotFoundError:
    # If the file doesn't exist, create the vectorized features and save them to a file
    X_vectorized = ember.create_vectorized_features(features_path)
    np.save(vectorized_path, X_vectorized)
    print('Created and saved vectorized features to file')

Vectorizing training set


  4%|▍         | 34716/800000 [56:42<2:07:28, 100.05it/s]

## **PDS**

First, we will create our Probabilistic Data Structure that mapping th SHA256 codes to mlicious or binain such that: \
1. Each Malicious file is mapped as mlicious. \
    **FNR(False Negative Rate)=0**
2. The number of binain files that are mapped as malicious is less than 0.01% from the number of all files. \
    **FPR(False Positive Rate)<=0.01%**

#### **Bloom Filter**

We are using the Bloom FIlter Data Structure for checking the exsiting of a malicious file in efficiet way. 

A Bloom filter is a probabilistic data structure that is used to efficiently test whether an element is a member of a set. It uses a bit array and a set of hash functions to represent the set of elements and to determine whether an element is likely to be a member of the set.

Bloom filters have several advantages over other data structures such as hash tables or binary search trees. \
- They are very space-efficient, requiring only a small amount of memory to store the bit array and hash functions.\
- They are very fast, with constant-time insertions and lookups. 
- They can handle large sets of elements with a low probability of false positives.

Bloom filters are particularly useful in cases where the cost of false positives is low (i.e., it is acceptable to occasionally say that a file is a malicous while is not), but the cost of false negatives is high (i.e., it is not acceptable to miss a malicous file).

In [None]:
class BloomFilter:
    def __init__(self, n, fpr):
        """
        Inputs:
         n - number of elements that might be inserted to the bloom filter
         fpr - the required False Positive Rate
        """
        bitarry_size, hash_count = self.find_optimal_size(n, fpr)
        self.n = n
        self.size = bitarry_size
        self.hash_count = hash_count
        self.bitarray = bitarray.bitarray(bitarry_size)
        self.bitarray.setall(0)

    def add(self, item):
        for i in range(self.hash_count):
            digest = hashlib.sha256(str(item).encode('utf-8') + str(i).encode('utf-8')).hexdigest()
            index = int(digest, 16) % self.size
            self.bitarray[index] = 1

    def __contains__(self, item):
        for i in range(self.hash_count):
            digest = hashlib.sha256(str(item).encode('utf-8') + str(i).encode('utf-8')).hexdigest()
            index = int(digest, 16) % self.size
            if not self.bitarray[index]:
                return False
        return True

    def get_size(self):
      return self.size

    def get_hash_count(self):
      return self.hash_count

    def find_optimal_size(self, n, fpr):
        """
        The find_optimal_size function compute the optimal size of a bit-array 
        and the number of hash functions needed for a Bloom filter that will store n items with a desired false positive rate of fpr.
        """
        bitarray_size = int(-1 * ((n * np.log(fpr)) / np.log(2) ** 2)) + 1
        number_of_hash_functions = int((bitarray_size / n) * math.log(2))
        return bitarray_size, number_of_hash_functions

    def __str__(self):
      return f"Bloom filter size: {self.size} \nFalse Positive Rate: {(1 - math.exp(-self.hash_count * self.n/ self.size)) ** self.hash_count} \nNumber of Hash functions: {self.hash_count}"

## Malicous SHA256 for EMBER

Read the SHA256 from the 'task1_malicious_sha256_ember.txt' and insert to our bloom filter.

In [70]:
fpr = 0.0001
with open(file_path_to_malicous_sha256_ember, 'r') as f:
    n = sum(1 for line in f) 
    bloom_filter = BloomFilter(n, fpr)
    for hash_code_line in f:
        bloom_filter.add(hash_code_line)
    print(bloom_filter)       

Bloom filter size: 7668047 
False Positive Rate: 0.00010013457033664568 
Number of Hash functions: 13


## **AI Powered Methods**

In the following part, we will investigate the features and try to build the best possible Model for the clients needs.

There are three main issues that we are going to consider:
1. The importance 