# Sampling Gmail Data

Although we have around 70k mails, we will need 1000-1500 well-labeled mails for the training purposes.
> **Label Quality matters far more than Volume.** 

## Objective

The sampling cannot be done randomly because we'll get :

- Mostly similar promotional and social media emails (more than 80-90% of the mails are promotional and social)
- Very few truly important emails (quantity is less)
- Almost no obvious spam (the quantity is already less than 100)

That's **bad training data** because the ML Model 

- Barely sees important emails
- Almost no spam examples
- Classifier becomes biased

> Random Sampling reflects reality but fails to teach the model how to to make predictions.



We'll adopt an intelligent Sampling Strategy, focusing on diversity across :

- Gmail label
- Sender Domain
- Email intent
- Time

The goal is to

> **Expose the model to different “types” of emails you receive.**

## Stratified Sampling

In our case, since 
- the gmail spam folder is tiny
- promotions and social media mails dominate
- important mails are fewer but critical

We'll adopt a strategy called Stratified Sampling.

> **Stratified sampling is a way of sampling data so that important subgroups (strata) are properly represented in your sample.**


1. Split the data into meaningful groups
2. Take random samples from each group

Instead of picking points purely at random, you **control the composition of the sample.**


## Exploring the data

In [2]:
#Loading the emails data into a Pandas dataframe
import pandas as pd

df = pd.read_csv("../data/raw_emails.csv")
print("Total emails:", len(df))
df.head()


Total emails: 69672


Unnamed: 0,message_id,subject,snippet,sender,sender_domain,internal_date,labels,has_attachment
0,19b27b8215ee9601,Give the Gift of Smart Tech | ESR,Get an extra 20% off sitewide with code 25XMAS...,ESRTech <newsletter@esrtech.com>,esrtech.com,1765900000000.0,"['CATEGORY_PROMOTIONS', 'UNREAD', 'INBOX']",False
1,19b271387099e3e7,The strength of working women in Afghanistan,Plus: A tumultuous history of punctuation ͏ ͏ ...,Aeon Daily <support@aeon.co>,aeon.co,1765890000000.0,"['UNREAD', 'CATEGORY_UPDATES', 'INBOX']",False
2,19b269adbb77da49,"Dear Customer, Important Update for K7 Antivir...",Don&#39;t Wait Till It&#39;s Too Late HACKERS ...,K7 Antivirus <alerts@k7computing.com>,k7computing.com,1765880000000.0,"['UNREAD', 'CATEGORY_UPDATES', 'INBOX']",False
3,19b267b7d35b4aa3,Last payment attempt unsuccessful for Jio Numb...,"Dear Customer, Your last payment attempt of Rs...",Notification@jio.com,jio.com,1765880000000.0,"['UNREAD', 'CATEGORY_UPDATES', 'INBOX']",False
4,19b266e80aebd978,Last payment attempt unsuccessful for Jio Numb...,"Dear Customer, Your last payment attempt of Rs...",Notification@jio.com,jio.com,1765880000000.0,"['UNREAD', 'CATEGORY_UPDATES', 'INBOX']",False


In [2]:
df["labels"].value_counts()

labels
['UNREAD', 'CATEGORY_SOCIAL', 'INBOX']                                     33697
['UNREAD', 'CATEGORY_UPDATES', 'INBOX']                                    19897
['CATEGORY_PROMOTIONS', 'UNREAD', 'INBOX']                                 11974
['CATEGORY_UPDATES', 'INBOX']                                               1329
['IMPORTANT', 'CATEGORY_UPDATES', 'INBOX']                                   628
['UNREAD', 'CATEGORY_PERSONAL', 'INBOX']                                     488
['UNREAD', 'IMPORTANT', 'CATEGORY_UPDATES', 'INBOX']                         478
['IMPORTANT', 'CATEGORY_PERSONAL', 'INBOX']                                  473
['UNREAD', 'IMPORTANT', 'CATEGORY_PERSONAL', 'INBOX']                        148
['CATEGORY_PERSONAL', 'INBOX']                                               127
['SENT']                                                                     116
['CATEGORY_PROMOTIONS', 'INBOX']                                              68
['CATEGORY_PROMOTIONS

In [None]:
## These are the different categories and labels. Almost 95% of my emails are unread. Clearly,I don't really read my emails.
## And almost 50% of them are social media updates.We have apps for social media updates, why are they sending me mails.

In [3]:
df["has_attachment"].value_counts()

has_attachment
False    69672
Name: count, dtype: int64

In [None]:
## None of the emails have any attachments.

In [4]:
df["sender_domain"].value_counts()

sender_domain
facebookmail.com       29677
linkedin.com            5203
quora.com               2159
jobalertshub.com        1237
amazon.in               1204
                       ...  
ndl.gov.in                 1
nw18.com                   1
email.bullguard.com        1
hubpages.com               1
craftsvilla.com            1
Name: count, Length: 794, dtype: int64

In [None]:
## Facebook sends me a lot of emails. I am not even active on it. Maybe that's why.

## Understanding Gmail Labels

In Gmail, everything is a label.

- Inbox, Spam, Promotions → all are labels
- Emails can have multiple labels at once
- Categories are just special system labels

So technically:

>“Categories are labels with special meaning.”

#### Core system labels (used internally by Gmail)

These are always present and very reliable.

| Label       | Meaning                        |
| ----------- | ------------------------------ |
| `INBOX`     | Appears in inbox               |
| `SPAM`      | Classified as spam             |
| `TRASH`     | Deleted                        |
| `UNREAD`    | Not opened                     |
| `STARRED`   | Starred by user                |
| `IMPORTANT` | Gmail thinks it matters to you |
| `SENT`      | Sent by you                    |
| `DRAFT`     | Draft email                    |

#### Category labels (the tabs you see in Gmail)

These are content-based classifications.

| Category   | Label ID              | Meaning | Relevance
| ---------- | --------------------- |----------| ----------|
| Primary    | `CATEGORY_PERSONAL`   |Person-to-person emails, Bank alerts, OTPs, Work / important communication| *High likelihood you’ll open it*|
| Promotions | `CATEGORY_PROMOTIONS` |Universities, Marketing emails, Discounts, newsletters, Brand communication| *Legit, but not urgent*|
| Social     | `CATEGORY_SOCIAL`     |LinkedIn, Facebook, Instagram, Twitter notifications|*Engagement-driven*|
| Updates    | `CATEGORY_UPDATES`    |Bills, Orders, Shipping notifications,Subscriptions| *Transactional*|


However, firstly

> **Gmail labels are not the Ground Truth.**

Because I see a lot of promotional emails under Primary Cateory and Updates Category. And newsletters under Promotions and a lot of brand promotions under Updates.

So, clearly it's not very reliable. Hence, I am going to **use Gmail labels as only hints** for my sampling process.

And secondly,

On exploring mail stats and looking at my mails, I think:
- My mailbox is primarily dominated by social media updates which I never look at. So, I'll put these in the "Spam" bucket because I don't want them and they ave filled my inbox.
- Primary Section has a lot of mails that I would want to read like newsletters, important updates, etc. But it has a lot of promotions too. I'll put these under the "Important" bucket for now and we can filter out thhe important ones in the manual labelling part.
- Some of the mails in the Updates section are important to me and I would want to actually read them like newsletters, delivery updates, transactions, etc. So, I'll put these under the "Important" bucket and again, filter them out in the manual labelling part.
- Promotions are present everywere in my mailbox except the Social Category. Promotions tab has some emails that I would want to read but I believe Primary and Updates are better sections to pick important emails from. Hence, I'll put these under the "Promo" Bucket.


### Target Sample Size 

Based on this understanding of Gmail labels, my mailbox constituents and personal relevance of the mails, let's pick a sample of size around 1200 with the following composition:


| Bucket   (Potential)                | Count | Why                         |Gmail Category|
| ------------------------ | ----- | --------------------------- |-------------------------|
| Important| 500   | Not very large in number, important and hence, High precision needed       |Primary / Updates|
| Promotions | 300   | Second most common class and unimportant       |Promotions|
| Spam   | 400   | Most Common class and unimportant |Social|

These buckets are the potential buckets based on the Gmail labels we have. After taking the sample, I'll manually label the mails under these 3 buckets based on my personal relevance, which will be the final labels used for training the ML model.

## Creating Helper Columns for Sampling

In [3]:
# Potential Personal Relevance Flags based on Gmail category flags
df["is_important"] = (
    df["labels"].str.contains("CATEGORY_PRIMARY", regex=False, na=False) |
    df["labels"].str.contains("CATEGORY_UPDATES", regex=False, na=False)
)

df["is_promo"] = df["labels"].str.contains("CATEGORY_PROMOTIONS", regex=False, na=False)
df["is_spam"] = df["labels"].str.contains("CATEGORY_SOCIAL", regex=False, na=False)


# Quick sanity check
df[["is_important", "is_promo", "is_spam"]].mean()


is_important    0.321306
is_promo        0.174015
is_spam         0.484355
dtype: float64

In [None]:
## Important Bucket as almost 32% of the mails, probably because we combined both Primary and Updates section into one bucket.
## Rest of the stats look aligned with my understanding of the mailbox.

### Sample Important emails

In [4]:
primary_df = df[df["is_important"]]

sample_primary = (
    primary_df
    .groupby("sender_domain", group_keys=False)
    .apply(lambda x: x.sample(min(len(x), 3), random_state=42))
)

sample_primary = sample_primary.sample(
    n=min(500, len(sample_primary)),
    random_state=42
)

print("Primary sample:", len(sample_primary))


Primary sample: 500


  .apply(lambda x: x.sample(min(len(x), 3), random_state=42))


### Sample Promotional emails

In [5]:
promo_df = df[df["is_promo"]]

sample_promo = (
    promo_df
    .groupby("sender_domain", group_keys=False)
    .apply(lambda x: x.sample(min(len(x), 2), random_state=42))
)

sample_promo = sample_promo.sample(
    n=min(300, len(sample_promo)),
    random_state=42
)

print("Promo sample:", len(sample_promo))


Promo sample: 300


  .apply(lambda x: x.sample(min(len(x), 2), random_state=42))


### Sample Spam Emails

In [6]:
spam_df = df[df["is_spam"]]

sample_other = (
    spam_df
    .groupby("sender_domain", group_keys=False)
    .apply(lambda x: x.sample(min(len(x), 2), random_state=42))
)

sample_other = sample_other.sample(
    n=min(400, len(sample_other)),
    random_state=42
)

print("Other sample:", len(sample_other))


Other sample: 21


  .apply(lambda x: x.sample(min(len(x), 2), random_state=42))


In [18]:
spam_df["sender_domain"].nunique()

11

In [None]:
## There are only 11 sender domains in the Social Category (probably because we have limited popular social media apps I have my account on).
## Hence, we should not limit the mails at 2 per sender domain to obtain the desired sample size.

In [7]:
sample_other = (
    spam_df
    .groupby("sender_domain", group_keys=False)
    .apply(lambda x: x.sample(
        n=max(1, int(400 * len(x) / len(spam_df))),
        random_state=42
    ))
)

print("Spam sample:", len(sample_other))

Spam sample: 399


  .apply(lambda x: x.sample(


### Combine all samples

In [8]:
sampled_df = pd.concat(
    [sample_primary, sample_promo, sample_other],
    ignore_index=True
)

print("Total sampled:", len(sampled_df))


Total sampled: 1199


## Prepare CSV for manual labeling

In [11]:
#Choosing a subset of required columns
label_df = sampled_df[
    [
        "message_id",
        "subject",
        "snippet",
        "sender",
        "sender_domain",
        "internal_date",
        "is_important",
        "is_promo",
        "is_spam"
    ]
].copy()


#Converting internal date from unix timestamp in milliseconds into date
label_df["internal_date"] = (
    pd.to_datetime(label_df["internal_date"], unit="ms", utc=True)
      .dt.tz_convert("Asia/Kolkata")
      .dt.date
)

#Creating plcaeholder column for manual labels
label_df["label"] = ""   # will fill this manually

#Converting into CSV
label_df.to_csv("../data/emails_to_label.csv", index=False)

#Looking at top 5 rows
label_df.head()


Unnamed: 0,message_id,subject,snippet,sender,sender_domain,internal_date,is_important,is_promo,is_spam,label
0,17e99c136d770537,Your mobile recharge for Rs. 15.00 is success...,"Amazon.in Recharges Dear customer, Your rechar...",Amazon Pay <noreply@amazonpay.in>,amazonpay.in,2022-01-27,True,False,False,
1,18c39485254e45a4,Your order is on its way,Download app The one you&#39;ve been waiting f...,RENTOMOJO <noreply@rentomojo.com>,rentomojo.com,2023-12-05,True,False,False,
2,182c941da6fa286d,Discover Fresh Arrivals for Our Brand New Cate...,Shop for ₹599 | Get Upto 40% Off | Code: MAMA4...,Mamaearth <support@info.mamaearth.in>,info.mamaearth.in,2022-08-23,True,False,False,
3,190b1e939cc8f72b,Product registration confirmation,PRODUCTS SUPPORT PRODUCT REGISTRATION Dear Pri...,Sony India Product Registration System <no-rep...,alerts.sony.co.in,2024-07-14,True,False,False,
4,17b4f317cb3f7b10,Online Live Project / Work Experience Program ...,"Dear Sri Venkateswara College Students, Here i...",Finlatics Hub <finlatics@fincruxtech.com>,fincruxtech.com,2021-08-16,True,False,False,


In [13]:
label_df["internal_date"].dtype

dtype('O')

In [15]:
## Checking if the samples are evenly spread over time or not
label_df["internal_date"] = pd.to_datetime(label_df["internal_date"])


flags = ["is_important", "is_promo", "is_spam"]

summary = {
    flag: label_df.loc[label_df[flag], "internal_date"].dt.year.value_counts()
    for flag in flags
}

summary_df = pd.DataFrame(summary).fillna(0).astype(int).sort_index()
summary_df


Unnamed: 0_level_0,is_important,is_promo,is_spam
internal_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017,4,3,2
2018,8,7,5
2019,4,7,2
2020,79,38,108
2021,91,59,249
2022,74,48,10
2023,82,55,4
2024,60,26,15
2025,98,57,4


In [None]:
## Looks like the sampling has chosen emails from all the past 9 years. 2017, 2018 and 2019 have lower proportion of emails probably because I started using internet a lot after 2019.

## Criteria for Manual Labelling

| Label         | Meaning             |
| ------------- | ------------------- |
| `important`   | I’d open & act    |
| `promotional` | Legit but ignorable |
| `spam`        | I don’t want this |
