# Red Teaming AI: Manipulating the Model  

## Overview  
This project explores how machine learning (ML) models react to changes in input data and training data, highlighting vulnerabilities that arise from data manipulation. By demonstrating real-world attack techniques, we showcase the security risks associated with adversarial ML.  

## Vulnerabilities Covered  
- **Injection Manipulation (ML01)** – Exploiting model inputs to manipulate predictions.  
- **Data Poisoning (ML02)** – Contaminating training data to degrade model integrity.  

## Dataset & Code  
The dataset and code used in this project are derived from the **Hack The Box AI Red Teaming spam classifier** featured in the *Applications of AI in InfoSec* module.  

## Objective  
By understanding these ML security risks, researchers and security professionals can develop better defenses against adversarial manipulation, ensuring model robustness and reliability in real-world deployments.  

---

# Creating the Model to Interact with from HTB's Code

---

In [9]:
# All Required Imports
import pandas as pd
import numpy as np
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

In [14]:
# Data Helper
def preprocess_message(message):
    stop_words = set(stopwords.words("english")) - {"free", "win", "cash", "urgent"}
    stemmer = PorterStemmer()

    message = message.lower()
    message = re.sub(r"[^a-z\s$!]", "", message)
    tokens = word_tokenize(message)
    tokens = [stemmer.stem(word) for word in tokens if word not in stop_words]
    return " ".join(tokens)


def preprocess_dataframe(df):
    df['message'] = df['message'].apply(preprocess_message)
    df = df.drop_duplicates()

    return df

In [16]:
# Model Helper

# classify messages by a trained model
def classify_messages(model, msg_df, return_probabilities=False):
    if isinstance(msg_df, str):
        msg_preprocessed = [preprocess_message(msg_df)]
    else:
        msg_preprocessed = [preprocess_message(msg) for msg in msg_df]

    msg_vectorized = model.named_steps["vectorizer"].transform(msg_preprocessed)

    if return_probabilities:
        return model.named_steps["classifier"].predict_proba(msg_vectorized)

    return model.named_steps["classifier"].predict(msg_vectorized)

In [18]:
# train a model on the given data set
def train(dataset):
    # read training data set
    df = pd.read_csv(dataset)

    # data preprocessing
    df = preprocess_dataframe(df)

    # data preparation
    vectorizer = CountVectorizer(min_df=1, max_df=0.9, ngram_range=(1, 2))
    X = vectorizer.fit_transform(df["message"])
    y = df["label"].apply(lambda x: 1 if x == "spam" else 0)

    # training
    pipeline = Pipeline([("vectorizer", vectorizer), ("classifier", MultinomialNB())])
    param_grid = {"classifier__alpha": [0.1, 0.5, 1.0]}
    grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring="f1")
    grid_search.fit(df["message"], y)
    best_model = grid_search.best_estimator_

    return best_model

In [20]:
# evaluate a given model on our test dataset
def evaluate(model, dataset):
    # read test data set
    df = pd.read_csv(dataset)

    # prepare labels
    df['label'] = df['label'].apply(lambda x: 1 if x == "spam" else 0)

    # get predictions
    predictions = classify_messages(model, df['message'])

    # compute accuracy
    correct = np.count_nonzero(predictions == df['label'])
    return (correct / len(df))

# Manipulating the Input
The code contains training and test data in CSV formats. Below we are running the model as expected.


---

In [24]:
model = train("./redteam_code/train.csv")
acc = evaluate(model, "./redteam_code/test.csv")
print(f"Model accuracy: {round(acc*100, 2)}%")

Model accuracy: 97.2%


We will adjust the code to print the ouput vulnerabilities for both classes for a given input message. When we run this code we will see the output probabilities from the model, which is a confidence test about the input message.

In [28]:
model = train("./redteam_code/train.csv")

message = "Hello World! How are you doing?"

predicted_class = classify_messages(model, message)[0]
predicted_class_str = "Ham" if predicted_class == 0 else "Spam"
probabilities = classify_messages(model, message, return_probabilities=True)[0]

print(f"Predicted class: {predicted_class_str}")
print("Probabilities:")
print(f"\t Ham: {round(probabilities[0]*100, 2)}%")
print(f"\tSpam: {round(probabilities[1]*100, 2)}%")

Predicted class: Ham
Probabilities:
	 Ham: 98.93%
	Spam: 1.07%


We can repeat this test with a message that appears like SPAM to test. In any type of input manipulation attack our goal is to trick the model into classifying spam (bad) as ham (good).

In [31]:
model = train("./redteam_code/train.csv")

message = "Congratulations! You won a prize. Click here to claim: https://bit.ly/3YCN7PF"

predicted_class = classify_messages(model, message)[0]
predicted_class_str = "Ham" if predicted_class == 0 else "Spam"
probabilities = classify_messages(model, message, return_probabilities=True)[0]

print(f"Predicted class: {predicted_class_str}")
print("Probabilities:")
print(f"\t Ham: {round(probabilities[0]*100, 2)}%")
print(f"\tSpam: {round(probabilities[1]*100, 2)}%")

Predicted class: Spam
Probabilities:
	 Ham: 0.0%
	Spam: 100.0%


#### Technique #1: Rephrasing
Our goal is to get the victim to click our link and avoid being classified as spam. We need to choose carefully how we can construct a message to bypass the model examination process. We can test to see how our model reacts to words by removing parts of our message and running it by in segments. See below for the response to "Congratulations!". 

This is something we can repeat to construct a dictionary of words that hit a low-rate of spam identification.

---

In [33]:
model = train("./redteam_code/train.csv")

message = "Congratulations!"

predicted_class = classify_messages(model, message)[0]
predicted_class_str = "Ham" if predicted_class == 0 else "Spam"
probabilities = classify_messages(model, message, return_probabilities=True)[0]

print(f"Predicted class: {predicted_class_str}")
print("Probabilities:")
print(f"\t Ham: {round(probabilities[0]*100, 2)}%")
print(f"\tSpam: {round(probabilities[1]*100, 2)}%")

Predicted class: Spam
Probabilities:
	 Ham: 35.03%
	Spam: 64.97%


Now we can attempt to circumvent the model's detection by putting together a more elegant message, notice that this reads at 90% (real).

In [44]:
model = train("./redteam_code/train.csv")

message = "Your account has been locked out. You can unlock your account in the next 24h: https://bit.ly/3YCN7PF"

predicted_class = classify_messages(model, message)[0]
predicted_class_str = "Ham" if predicted_class == 0 else "Spam"
probabilities = classify_messages(model, message, return_probabilities=True)[0]

print(f"Predicted class: {predicted_class_str}")
print("Probabilities:")
print(f"\t Ham: {round(probabilities[0]*100, 2)}%")
print(f"\tSpam: {round(probabilities[1]*100, 2)}%")

Predicted class: Ham
Probabilities:
	 Ham: 90.08%
	Spam: 9.92%


#### Technique #2: Overpowering
You can also take those benign words (not flagged as spam) and add tons of them in order to weight the scale heavier toward a "HAM" majority message. That will improve the rating of SPAM vs HAM. This is a way to shadow your intentions from an ML model.

Another way to obfuscate this insane message from the user could be to include them in the format of HTML comments or other ways to hide the text in the background but STILL weight this message toward HAM.

---

In [49]:
model = train("./redteam_code/train.csv")

message = "Congratulations! You won a prize. Click here to claim: https://bit.ly/3YCN7PF. But I must explain to you how all this mistaken idea of denouncing pleasure and praising pain was born and I will give you a complete account of the system, and expound the actual teachings of the great explorer of the truth, the master-builder of human happiness."
predicted_class = classify_messages(model, message)[0]
predicted_class_str = "Ham" if predicted_class == 0 else "Spam"
probabilities = classify_messages(model, message, return_probabilities=True)[0]

print(f"Predicted class: {predicted_class_str}")
print("Probabilities:")
print(f"\t Ham: {round(probabilities[0]*100, 2)}%")
print(f"\tSpam: {round(probabilities[1]*100, 2)}%")

Predicted class: Ham
Probabilities:
	 Ham: 100.0%
	Spam: 0.0%


# Manipulating the Training Data
I did not manipulate the size of the dataset- I left it alone (be aware if you're doing this by HTB's course, they wanted that).

---

In [56]:
model = train("./redteam_code/train.csv")

message = "Hello World! How are you doing?"

predicted_class = classify_messages(model, message)[0]
predicted_class_str = "Ham" if predicted_class == 0 else "Spam"
probabilities = classify_messages(model, message, return_probabilities=True)[0]

print(f"Predicted class: {predicted_class_str}")
print("Probabilities:")
print(f"\t Ham: {round(probabilities[0]*100, 2)}%")
print(f"\tSpam: {round(probabilities[1]*100, 2)}%")

Predicted class: Ham
Probabilities:
	 Ham: 98.93%
	Spam: 1.07%


We can add key words that flag as spam to how the weighting interferes with the outcome. Now we can see below when we add 'spam', it barely influences the outcome.

In [58]:
model = train("./redteam_code/train.csv")

message = "spam,Hello World! How are you doing?"

predicted_class = classify_messages(model, message)[0]
predicted_class_str = "Ham" if predicted_class == 0 else "Spam"
probabilities = classify_messages(model, message, return_probabilities=True)[0]

print(f"Predicted class: {predicted_class_str}")
print("Probabilities:")
print(f"\t Ham: {round(probabilities[0]*100, 2)}%")
print(f"\tSpam: {round(probabilities[1]*100, 2)}%")

Predicted class: Ham
Probabilities:
	 Ham: 96.22%
	Spam: 3.78%
