# Email Importance Recognition

Personally, I get a lot of emails that are very irrelevant to me. And no matter how many I unsubscribe from, more always come. And in the process of getting rid of all of this junk, I often find that something important gets lost in the mix. So I have made a process using Google's Gemini 2.0 flash model to predict how important an email is to a given user.

## Dataset

Ideally, we would use an API call to automatically retrieve emails from a service. However, if we wish to use a service like Microsoft Outlook or Gmail, this requires verified OAuth2 access, which is a little out-of-scope for this project. And unfortunately, a list of example emails was out of the question as well, as they do not come with relevance to a given user. So in the end, I used a manually assembled list of 12 emails straight from my own inbox to test on.

Additionally, I have included an example 'User Preferences' string that resembles something I might have used for myself.

### Loading the data:

In [2]:
import json

emails = {}
with open("sample_emails.json", "r") as f:
    emails = json.load(f)
emails, user_pref = emails["Emails"], emails["UserPreferences"]

## Assessing the Importance

In order to categorize these emails, we first need a pre-trained model to work with (I will certainly not be able to train a model on my own to have complex and high-order reasoning skill and a solid grasp on language from only 12 emails).

In [1]:
from google import genai
import torch

use_cuda = torch.cuda.is_available()

key = open("C:\\Users\\peter\\OneDrive\\Documents\\GeminiAPI.txt", "r").read() # Replace with your key
client = genai.Client(api_key=key)

## The Measure

This whole time I have been mentioning 'rating importance', but how can this actually be accomplished?
There are 2 main ways to approach this:
- Relative ordering
- Giving a direct score

Relative ordering involves giving the AI 2 emails at once and having the model chose which is more important. This is likely a better approach in terms of results, as the model will be better at this task than a somewhat arbitrary score, but it has it's downsides. Namely that it will often involve feeding the same email to the model multiple times, which requires it to parse more tokens, which is not something we have unlimited access of for our free model. Additionally, it is a more complex method.\
\
For these reasons, I went with a more direct approach: simply asking the model for a score given an email and the user preferences. While this method is cheap and easy, it can lack nuance. Trying to define the scale for the scores can be challenging, especially when trying to phrase them in a way the model will truly grasp and understand.\
\
In the end, I decided on a scale of 1 to 5 (with 5 being the most important), and instructed the model to only use '1' and '5' in extreme cases.

In [3]:
def compute_rating(email, UserPreference=""):
    input = f"""
        Please give an integer rating of 1, 2, 3, 4, or 5 representing how important the following email would be to the user (5 is the most important). Do not output anything else, no explanation is needed.
        Try to reserve 1 for truly useless emails, and 5 for very high priority emails. Emails that could potentially be useful or important to the user should on average receive a 3.
        The user may also provide some details about them that would influence what is or is not important to them.
        Details about the user: {UserPreference}
        
        Email:
        From: {email['from']}
        Subject: {email['subj']}
        
        {email['body']}"""
    
    response = client.models.generate_content(
        model="gemini-2.0-flash",
        contents=input,
    )
    
    res = response.text[0]
    try:
        res = int(res)
    except:
        res = 0
        print("Error parsing response. Expected integer, got: ", end="")
        print(res)
        print(f"({response.text})")
    
    return res

And now, to run the test for 1 example:

In [None]:
print(f"Email:\nFrom: {emails[0]['from']}\nSubject: {emails[0]['subj']}\n\n{emails[0]['body'][:80]}...")
print(f"\n\nModel Rating: {compute_rating(emails[0], user_pref)}/5")

Email:
From: OneDrive <no-reply@onedrive.com>
Subject: Many files were recently deleted from your OneDrive

Files are permanently removed from your OneDrive recycle bin 30 days after they ...
Model Rating: 3/5


## Scoring the results

Great! Now we have a way to give emails a level of importance based on a given user. But how can we even tell if it is doing a good job? For this, we need to figure out some sort of metric.

### The metric

For this metric, I decided to simply check how close the model was to what the correct result was. In order to make this correct result, we can simply go in and do the model's job for it once - giving every email an importance level we think appropriate. Once this is done, we can run the model again for every email, and see how close it was to the real answer. Then, we can take the Mean Absolute Error (or the Total Absolute Error) and use this to fine tune any aspect of the model or the input we provide to it.\
\
(Using threading for parallel operation)

In [7]:
import threading
import time

results = {}
res_sem = threading.Semaphore()




def compute_rating_parallel(email, i, UserPreference=""):
    input = f"""
        Please give an integer rating of 1, 2, 3, 4, or 5 representing how important the following email would be to the user (5 is the most important). Do not output anything else, no explanation is needed.
        Try to reserve 1 for truly useless emails, and 5 for very high priority emails. Emails that could potentially be useful or important to the user should on average receive a 3.
        The user may also provide some details about them that would influence what is or is not important to them.
        Details about the user: {UserPreference}
        
        Email:
        From: {email['from']}
        Subject: {email['subj']}
        
        {email['body']}"""
    
    response = client.models.generate_content(
        model="gemini-2.0-flash",
        contents=input,
    )
    
    res = response.text[0]
    try:
        res = int(res)
    except:
        res = 0
        print("Error parsing response. Expected integer, got: ", end="")
        print(res)
        print(f"({response.text})")
    
    res_sem.acquire()
    results[i] = res
    res_sem.release()





def compute_rating_for_all(emails, user_pref=""):
    global res_sem, results
    
    threads = []
    
    for i, email in enumerate(emails):
        # dispatch a thread to compute the rating for the current email
        time.sleep(0.1)  # sleep for a bit to avoid overwhelming the server
        thread = threading.Thread(target=compute_rating_parallel, args=(email, i, user_pref))
        thread.daemon = True
        thread.start()
        threads.append(thread)
    
    # yield results as they come
    for i, thread in enumerate(threads):
        thread.join()
        res_sem.acquire()
        yield results[i]
        res_sem.release()

And run it:

In [10]:
total_diff = 0
for i, rating in enumerate(compute_rating_for_all(emails, user_pref)):
    print(f"Rating {i+1}/{len(emails)}: {results[i]} (real: {emails[i]['real_rating']})")
    total_diff += abs(results[i] - emails[i]["real_rating"])
print(f"Average difference: {total_diff/len(emails)}")
print(f"Total difference: {total_diff}")
print("Done")

Rating 1/12: 3 (real: 3)
Rating 2/12: 1 (real: 1)
Rating 3/12: 1 (real: 2)
Rating 4/12: 1 (real: 1)
Rating 5/12: 1 (real: 1)
Rating 6/12: 1 (real: 1)
Rating 7/12: 2 (real: 1)
Rating 8/12: 2 (real: 1)
Rating 9/12: 4 (real: 4)
Rating 10/12: 5 (real: 5)
Rating 11/12: 1 (real: 3)
Rating 12/12: 4 (real: 4)
Average difference: 0.4166666666666667
Total difference: 5
Done


## Refining the prompt

I largely only used this difference result to fine-tune the prompt, but it can easily be used to measure the impact for any change to this process. I tried a lot of things, but found that my original prompt was almost spot-on. A few of the things I tried:
- Reordering the prompt
    - The order of the instructions, user preference, email metadata, and email body can be rearranged to yield different qualities of results. In the end though, I found that the obvious ordering of Explanation, User preference, Email metadata, then Email body was the best choice for accurate answers.
- Rephrasing the instructions
    - This was one of the main things I changed from the original prompt. Originally, the instructions were brief. The prompt simply stated that the model needed to output an integer from 1 to 5 to rank the importance of the following email, and that 5 was the most important. As it turns out, this gave the model a tendency to be overly fond of either a really high or really low rating, and to rarely give any other rating. I ended up needing to repeat this information and emphasize it in order to get the model to fully utilize the 1-5 scale.
- Removing elements
    - While most of the items included in the prompt seem necessary, the body of the email is not only one of the most token-heavy items, but is also potentially redundant, as it is summarized by the "subject" field by nature. However, my attempts to remove it to save on token usage had drastic negative impact on accuracy, and after some other tweaks to confirm this wasn't a fluke, the body was re-added.


Overall, the changes I made managed to bring the model's Total Difference for this dataset down from 10 to 5, which is quite the improvement, considering a random number generator would theoretically average a Total Difference of 16.

## Summary

Overall, I would say that this concept has some serious potential. I know that personally, it could up my productivity with emails by a significant factor. The model works rather well, and I believe some more tweaks to the prompt, as well as switching to a more powerful model could both improve the model's accuracy even further. However, this implementation still has a ways to go.

## Limitations and future prospects

As mentioned, one of the biggest limiting factors for most of the decisions I made for this project was the ever-looming token limit. I was constantly butting up against it during testing, can could only run 1-2 tests every few minutes even with all of the token-optimizing in place. And since I do not want to compromise model quality, I believe the best solution would be to upgrade to a paid model (I have been pointed to gpt-4 as an affordable model for this scale). This would allow me to try many more things, such as the relative ordering mentioned back near the start of this notebook.\
\
As far as future prospects go, I have a lot of plans. I am currently running a small TKinter app in python to have a GUI for this application, but it is still small, ugly, and small scale, so an upgrade is definitely necessary. Additionally, getting OAuth2 to work so I can retrieve emails straight from Gmail would be ideal, and would make it a more proper application that I could more easily put on my portfolio or resume. Furthermore, in order to do all of this, I would need to purchase a database and web service for OAuth2, which could also be used to better manage user preferences, and make the API calls to the model instead of doing so from the client.\
\
So there are a lot of thing still to do for this project, and it still feels to be in it's infancy, but the road forward appears clear, and the future looks bright.