## 1. Data Loading and Initial Preprocessing

In [65]:
import os #os is used for interacting with the operating system
import pandas as pd
import numpy as np #numpy is used for numerical operations
from sklearn.feature_extraction.text import TfidfVectorizer # TfidfVectorizer is used to convert a collection of raw documents to a matrix of TF-IDF features
from sklearn.preprocessing import LabelEncoder # LabelEncoder is used to convert categorical labels into numerical format
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier # RandomForestClassifier is a machine learning model used for classification tasks
from sklearn.metrics import classification_report, accuracy_score


In [66]:
# Define the path to the input files folder
data_dir = '/Users/dianaterraza/Desktop/data_scientist_yara_project/new_input_files'

# List to store the data
data = []

Backdoor.txt, Dialer.txt, Adware.txt, BrowserModifier.txt, and Constructor.txt cannot be read using the utf-8 encoding. This is likely because the files contain non-text data or characters that are not valid in the UTF-8 standard.

In [67]:
# Loop through each file in the folder
for filename in os.listdir(data_dir):
    if filename.endswith('.txt') or filename.endswith('.csv'):
        file_path = os.path.join(data_dir, filename)
        label = os.path.splitext(filename)[0]
        
        try:
            # Try reading with utf-8 first 
            with open(file_path, 'r', encoding='utf-8') as f:
                content = f.read()
            
            data.append({'content': content, 'label': label})
            
        except UnicodeDecodeError:
            # If utf-8 fails, try a different encoding like 'latin-1' (which are often used for legacy or system-specific text files.)
            try:
                with open(file_path, 'r', encoding='latin-1') as f:
                    content = f.read()
                data.append({'content': content, 'label': label})
                print(f"Successfully read {filename} with latin-1 encoding.")
            except Exception as e:
                print(f"Could not read {filename} with latin-1: {e}")
        
        except Exception as e:
            print(f"Error reading file {filename}: {e}")

Successfully read Backdoor.txt with latin-1 encoding.
Successfully read Dialer.txt with latin-1 encoding.
Successfully read Behavior.txt with latin-1 encoding.
Successfully read BrowserModifier.txt with latin-1 encoding.
Successfully read Adware.txt with latin-1 encoding.
Successfully read Constructor.txt with latin-1 encoding.


In [68]:
# Create a pandas DataFrame from the list
df = pd.DataFrame(data)

In [69]:
df.head(20)

Unnamed: 0,content,label
0,ÿþW�o�r�m�:�W�i�n�3�2�/�W�o�o�t�b�o�t�\n�\n�B�...,Backdoor
1,ÿþD�i�a�l�e�r�:�W�i�n�3�2�/�A�c�o�n�t�i�\n�\n�...,Dialer
2,ÿþB�e�h�a�v�i�o�r�:�W�i�n�3�2�/�M�o�d�i�f�i�e�...,Behavior
3,ÿþB�r�o�w�s�e�r�M�o�d�i�f�i�e�r�:�W�i�n�3�2�/�...,BrowserModifier
4,ÿþA�d�w�a�r�e�:�W�i�n�3�2�/�A�d�R�o�t�a�t�o�r�...,Adware
5,"signature,metadata_category,metadata_comment\n...",Microsoft_Defender_All_signatures_list
6,ÿþC�o�n�s�t�r�u�c�t�o�r�:�W�i�n�3�2�/�S�e�v�e�...,Constructor


In [70]:
# Display the first few rows of the DataFrame
print(df.head())
print(df.info())

                                             content            label
0  ÿþW o r m : W i n 3 2 / W o o t b o t \n \n B ...         Backdoor
1  ÿþD i a l e r : W i n 3 2 / A c o n t i \n \n ...           Dialer
2  ÿþB e h a v i o r : W i n 3 2 / M o d i f i e ...         Behavior
3  ÿþB r o w s e r M o d i f i e r : W i n 3 2 / ...  BrowserModifier
4  ÿþA d w a r e : W i n 3 2 / A d R o t a t o r ...           Adware
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   content  7 non-null      object
 1   label    7 non-null      object
dtypes: object(2)
memory usage: 244.0+ bytes
None


## 2. Clean the data: 

* Convert to lowercase: So that "Adware" and "adware" are treated as the same word.

* Remove non-alphanumeric characters: To get rid of punctuation, numbers, and symbols that don't add value.

* Remove extra whitespace: To ensure consistent word separation.

In [71]:
import re # Regular expressions for text cleaning (is a module in python)

def clean_text(text):
    # Convert text to lowercase
    text = text.lower()
    # Remove non-alphanumeric characters and replace them with a space
    text = re.sub(r'[^a-z\s]', '', text)
    # Remove multiple spaces
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Apply the cleaning function to the 'content' column of the DataFrame
df['cleaned_content'] = df['content'].apply(clean_text)

# Display the first few rows to verify the changes
print(df[['content', 'cleaned_content']].head())

                                             content  \
0  ÿþW o r m : W i n 3 2 / W o o t b o t \n \n B ...   
1  ÿþD i a l e r : W i n 3 2 / A c o n t i \n \n ...   
2  ÿþB e h a v i o r : W i n 3 2 / M o d i f i e ...   
3  ÿþB r o w s e r M o d i f i e r : W i n 3 2 / ...   
4  ÿþA d w a r e : W i n 3 2 / A d R o t a t o r ...   

                                     cleaned_content  
0  wormwinwootbot backdoorwinhupigon backdoorwina...  
1  dialerwinaconti dialerwinactivestripplayer dia...  
2  behaviorwinmodifiedautoruninf behaviorwindropp...  
3  browsermodifierwinadvsearch browsermodifierwin...  
4  adwarewinadrotator adwarewinbonzibuddy adwarew...  


## 3. Feature Engineering

TF-IDF (Term Frequency-Inverse Document Frequency). This technique converts clean text into a numerical representation that machine learning models can understand. It helps identify the most important words in each document, giving them greater weight if they are relevant to a document but rare in the rest of the corpus.

This process is done in two steps:

### 3.1 Vectorization with TF-IDF
Uses the scikit-learn TFIDFVectorizer class to transform the cleaned text column (cleaned_content) into a feature matrix.

In [72]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TF-IDF vectorizer instance
# min_df=1 ensures that words appearing at least once are included
vectorizer = TfidfVectorizer(min_df=1)

# Fit the vectorizer to your data and transform it into a matrix
X = vectorizer.fit_transform(df['cleaned_content'])

print("Dimensions of the feature matrix:", X.shape)

Dimensions of the feature matrix: (7, 335844)


Code explanation: 

* TfidfVectorizer(): This class not only counts word frequency but also weights them based on their importance.

* X = vectorizer.fit_transform(...): This method learns the vocabulary from your text and then transforms that text into a sparse matrix of TF-IDF values.

### 3.2  Preparation for Modeling
Once I have my feature matrix X, I need my label vector (y). This vector contains the categories of the data ('Adware', 'Backdoor', etc.). scikit-learn also has tools to handle this:  LabelEncoder

In [73]:
from sklearn.preprocessing import LabelEncoder

# Create a label encoder instance
label_encoder = LabelEncoder()

# Transform text labels into numerical values
y = label_encoder.fit_transform(df['label'])

print("Numerical labels (y):", y)
print("Original classes:", label_encoder.classes_)

Numerical labels (y): [1 5 2 3 0 6 4]
Original classes: ['Adware' 'Backdoor' 'Behavior' 'BrowserModifier' 'Constructor' 'Dialer'
 'Microsoft_Defender_All_signatures_list']


## 4. Select the ML Model

I chose to use RandomForestClassifier because it's a great option. It's robust, easy to use, and often produces good results without much optimization. To implement the model I will follow these 3 steps:

### 4.1 Data Splitting
First, split the data into training and testing sets. The training set is used for the model to learn from, while the testing set is used to evaluate its performance on data it has never seen before.

In [74]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets (80% for training, 20% for testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set dimensions:", X_train.shape)
print("Testing set dimensions:", X_test.shape)

Training set dimensions: (5, 335844)
Testing set dimensions: (2, 335844)


Notes: train_test_split: This function randomly divides the data. The random_state parameter ensures that the split is the same every time you run the code, guaranteeing reproducibility.

### 4.2 Model Training
Now, train the RandomForestClassifier using the training data.



In [75]:
from sklearn.ensemble import RandomForestClassifier

# Create a RandomForestClassifier model instance
model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model with the training data
model.fit(X_train, y_train)

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


Notes: 
* RandomForestClassifier: This creates an instance of the model. The n_estimators=100 parameter specifies that 100 decision trees will be used in the random forest.

* model.fit(X_train, y_train): This is the training step. The model learns to map the features (X_train) to their labels (y_train).

### 4.3 Model Evaluation
Once the model is trained, evaluate its performance on the testing data.

In [76]:
from sklearn.metrics import classification_report, accuracy_score

# Make predictions on the test data
y_pred = model.predict(X_test)

# Get the full list of class names from the label encoder
all_class_names = label_encoder.classes_

# Use the labels parameter to specify the classes present in the test set
# This ensures that the report only includes the classes that exist in y_test
report = classification_report(y_test, y_pred, labels=np.unique(y_test), target_names=all_class_names)

# Calculate the accuracy and other metrics
accuracy = accuracy_score(y_test, y_pred)

print("Model Accuracy:", accuracy)
print("Classification Report:\n", report)


Model Accuracy: 0.0
Classification Report:
                                         precision    recall  f1-score   support

                                Adware       0.00      0.00      0.00       1.0
                              Backdoor       0.00      0.00      0.00       1.0

                             micro avg       0.00      0.00      0.00       2.0
                             macro avg       0.00      0.00      0.00       2.0
                          weighted avg       0.00      0.00      0.00       2.0



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


## 5. Final Note:

**Justification of why the Machine Learning option is not viable**

Using Machine Learning is not a viable option for this task because the training dataset is too small and imbalanced. A model needs a large number of examples for each class to effectively learn patterns. With only 1 or 2 instances per threat type in the test set, the model doesn't have enough data to generalize, resulting in zero performance (as shown by the precision, recall, and f1-score metrics of 0.00).

In this scenario, a direct feature engineering and pattern matching approach is more robust and appropriate. Instead of trying to get a model to learn, you can explicitly extract the indicators of compromise (IOCs) and create precise and functional YARA rules.