# Build Your Own Gmail Spam Filter - Any LLM You Choose

Are you annoyed with constantly receiving spam or sales emails? You're not alone. Despite Gmail's spam filter 
catching over 50 spam emails daily, we, as startup founders, still found our inboxes flooded with 20+ unwanted 
emails each day.

Last weekend, we decided to write our own personalized spam filter script to filter out unwanted emails from our 
inbox. There are some online posts showing how to use GPT to fitler email, but still hard to config and 
only support OpenAI. Hence, we decide to share our solution, which are
- **Quick to set up** (under an hour),
- **Cost-effective** (less than $1 per month),
- **Compatible with various LLM models** (including OpenAI, Google Gemini, Anthropic, Mistral-7B, LLaMA, and more).

In this example, we will show you how to choose an LLM and build your own spam gmails filter via 
[Uniflow](https://github.com/CambioML/uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering), 
an open-source and LLM agonistic library for data cleaning.


<div style="border-left: 4px solid lightblue; background-color: lightblue; color: white; padding: 10px 20px; margin: 20px 0;">

### Before running the code

1. [Estimated: 5 min] You will need to set up the `uniflow` conda environment to run this notebook by following this [instruction](https://github.com/CambioML/uniflow/tree/main#installation).

2. [Estimated: 5 min] You will need a valid api key to run the code (either OpenAI or Google Gemini):
    - [Get Google API key](https://ai.google.dev/tutorials/setup). 
    - [Get OpenAI API key](https://platform.openai.com/api-keys). 
 
    Once you have the key, set it as the environment variable (either `OPENAI_API_KEY` or `GOOGLE_API_KEY`) within a `.env` file in the root directory of this repository. For more details, see this [instruction](https://github.com/CambioML/uniflow/tree/main#api-keys).

</div>

### Import the libraries and update system path

Install the dependencies in your environment.

In [1]:
!pip install -q tiktoken uniflow==0.0.27
!pip install -q --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib

In [2]:
from uniflow import Context
from uniflow.flow.client import ExtractClient
from uniflow.flow.config import ExtractGmailConfig
from uniflow.flow.client import TransformClient
from uniflow.flow.config  import TransformGmailSpamConfig
from uniflow.op.model.model_config  import GoogleModelConfig, OpenAIModelConfig

from dotenv import load_dotenv
load_dotenv()


  from .autonotebook import tqdm as notebook_tqdm


True

### Initialize an `ExtractClient` with `ExtractGmailConfig`

There are two steps to build an Gmail spam filter: extract the email content and then classify whether it's spam. Let's start with the extract step.

<div style="border-left: 4px solid lightblue; background-color: lightblue; color: white; padding: 10px 20px; margin: 20px 0;">

### Set credentials

[Estimated: 10 min] You will need to setup and download `credentials.json` following google workspace [instructions](https://developers.google.com/gmail/api/quickstart/python). 

</div>


With the credentials, you can initialize an `ExtractClient` to extract your email text.

In [None]:
extract_client = ExtractClient(
    ExtractGmailConfig(
        credentials_path="credentials.json",
        token_path="token.json",
        )
    )

Once run the above cell, there will be a pop up URL to authorize the Gmail log in from your own gmail spam filter. 

Click `Allow`:

<p align="center">
  <img src="./gmail-spam-filter-login.png" alt="Alternate Text" width="50%" />
</p>


Now use the `extract_client` to extract the latest unread email body and snippet (maximum 7 emails). Estimated running time: 10-20 seconds.

In [4]:
extract_data = extract_client.run([{}])

100%|██████████| 1/1 [00:05<00:00,  5.34s/it]


### Initialize an `TransformClient` with `TransformGmailSpamConfig`

Now you have retrieved the email's text using the `Extract` client. Now we define a `Transform` client to classify whether an given email is spam or not. The default `TransformGmailSpamConfig` contains instructions and few shots prompt regarding spam classification task.

In [5]:
# Comment and uncomment to try both openai and google models
transform_client = TransformClient(
    TransformGmailSpamConfig(
        flow_name="TransformOpenAIFlow",
        model_config=OpenAIModelConfig(),
        # flow_name="TransformGoogleFlow",
        # model_config=GoogleModelConfig()
        )
    )


The `transform_client` will take the extract result from `extract_client` and further transform it with output contains classification label. Let's print out the default `TransformGmailSpamConfig` to see the default prompts and model configs.

In [6]:
from pprint import pprint
pprint(transform_client._config)


TransformGmailSpamConfig(flow_name='TransformOpenAIFlow',
                         model_config=OpenAIModelConfig(model_name='gpt-3.5-turbo-1106',
                                                        model_server='OpenAIModelServer',
                                                        num_call=1,
                                                        temperature=0.9,
                                                        response_format={'type': 'text'},
                                                        num_thread=1,
                                                        batch_size=1),
                         num_thread=1,
                         prompt_template=PromptTemplate(instruction='You are a highly intelligent AI trained to identify spam emails. Is this email a spam email?. Follow the format of the few shot examples below to include explain and answer in the response for the given email. You answer should be either Yes or No.', few_shot_prompt=[Context(email="

Rather than throwing the email body and snippet to the `transform_client`, we clean up and only use the first 5000 characters to avoid time out.

Let's take a look of the first email extracted text:

In [7]:
data_to_transform = []
for d in extract_data[0]['output'][0]:
    if d['body']:
        data_to_transform.append(Context(email=d['body'][:5000]))
    else:
        data_to_transform.append(Context(email=d['snippet'][:5000]))
data_to_transform[0]

Context(email=b'View this post on the web at https://www.datagravity.dev/p/150b-of-annual-capex-the-trends-in\r\n\r\nMeta recently announced acquiring over 600K GPUs in 2024 [ https://substack.com/redirect/e7a530a0-c07c-44cd-b92d-a9aa6a7caa89?j=eyJ1IjoiMzBwdXhjIn0.D3udPa-oMTfcKnoniYWOWKKkcZ3T4hfaoUhanbZZxGs ] with plans for $37B of total capital expenditure investment. The announcement triggered a wave of analysis speculating which hardware vendors would benefit from this data center investment. \r\nToday, we look at the CAPEX of all 6 of the trillion dollar tech companies \xe2\x80\x94 which are Apple, Microsoft, Nvidia, Amazon, Google and Meta. The majority of this total CAPEX involves data centers, for a mix of internal and customer usage \xe2\x80\x94 though there are big strategic differences by company which reflect unique business models and goals. We calculated these metrics by downloading the cash flow statements of each company and compiling the CAPEX data into one sheet (below

Now it's the time to run the `transform_client` to classify whether an email is spam or not.

In [8]:
transform_output = transform_client.run(data_to_transform)

100%|██████████| 7/7 [00:06<00:00,  1.02it/s]


### Update corresponding email with label

Finally, we can interface with the Gmail API to sort emails with precision, categorizing them as either spam or not with the help of an AI Email Filter. By setting the stage with essential imports and establishing secure OAuth 2.0 credentials, we ensure a seamless connection that grants us the power to modify a user's Gmail according to our needs, specifically under the gmail.modify scope. 

The heart of our operation lies in a meticulously crafted loop, where we evaluate each email's spam status through an AI's lens and assign the correct label, either marking it as spam or confirming its legitimacy. This process leverages the Gmail API's messages.modify method, seamlessly updating each email's label based on our AI filter's discerning judgment.

In [12]:
from google.oauth2.credentials import Credentials
from googleapiclient.discovery import build

SPAM_LABEL = "Spam Email (AI Email Filter)"
NON_SPAM_LABEL = "Email (AI Email Filter)"

SCOPES = ["https://www.googleapis.com/auth/gmail.modify"]
creds = Credentials.from_authorized_user_file("token.json", SCOPES)
service = build("gmail", "v1", credentials=creds)


def get_label_id(service, label_name):
    labels = service.users().labels().list(userId='me').execute().get('labels', [])
    for label in labels:
        if label['name'] == label_name:
            return label['id']
    return None

SPAM_LABEL_ID = get_label_id(service, SPAM_LABEL)
NON_SPAM_LABEL_ID = get_label_id(service, NON_SPAM_LABEL)

for e, t in zip(extract_data[0]['output'][0], transform_output):
    # true if spam, false if not
    is_spam = "yes" in t['output'][0]['response'][0].lower()
    print(f"Email {e['email_id']} is spam: {is_spam}")
    email_id = e['email_id']
    label_id = SPAM_LABEL_ID if is_spam else NON_SPAM_LABEL_ID
    service.users().messages().modify(userId='me', id=e['email_id'], body={'addLabelIds': [label_id], 'removeLabelIds': []}).execute()

Email 18e199f305038665 is spam: False
Email 18e19935d9217e9d is spam: True
Email 18e197cf17c6bcde is spam: False
Email 18e196121dc03ef4 is spam: True
Email 18e1937489ff5f21 is spam: True
Email 18e18fb5375cfc71 is spam: False
Email 18e163b01309eb98 is spam: False


Now, your email has been auto-classified. Go to your Gmail and search for `label:spam-email--ai-email-filter- `:

<p align="center">
  <img src="./gmail-spam-filter-final-results.png" alt="Alternate Text" width="100%" />
</p>

## End of the notebook

Check more Uniflow use cases in the [example folder](https://github.com/CambioML/uniflow/tree/main/example/model#examples)!

<a href="https://www.cambioml.com/" title="Title">
    <img src="../image/cambioml_logo_large.png" style="height: 100px; display: block; margin-left: auto; margin-right: auto;"/>
</a>