## Email Sentiment Analysis – What I Did (Quick Notes)

1. **Goal**
   Built a system to classify client emails into **Positive / Negative / Neutral** using ML.

2. **Data**
   Used a large email dataset; observed **heavy class imbalance** (mostly positive emails).

3. **Preprocessing**
   Minimal cleaning (lowercasing) to preserve sentiment-bearing words.
   Same preprocessing used for training and prediction.

4. **Feature Engineering**
   Converted text to numbers using **TF-IDF (unigrams + bigrams)**.
   This captures both important words and short phrases.

5. **Model Choice**
   Tested Logistic Regression and **SVM**.
   SVM performed better for text data (high-dimensional, sparse features).

6. **Imbalance Handling**
   Applied **class weights** to penalize misclassification of negative emails.
   Focused on improving **negative recall**, not just accuracy.

7. **Training Results**
   Training accuracy ≈ **97%**
   Testing accuracy ≈ **87%**
   Gap acceptable → slight overfitting but good generalization.

8. **Prediction Pipeline**
   Built Excel → Model → Excel flow for real-world usability.

9. **Key Learning**
   Accuracy alone is misleading in imbalanced datasets;
   **Recall and F1-score** matter more for critical classes (negative emails).

10. **Outcome**
    End-to-end NLP + ML pipeline ready for practical use and demonstration.



In [96]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [97]:
import pandas as pd

# Load dataset
df = pd.read_csv("/content/drive/MyDrive/emails.csv")

# Check size
print(df.shape)

(517401, 2)


In [98]:
df = df[['message']]

In [99]:
df_sample = df.sample(n=50000, random_state=42)

In [100]:
df_sample = df_sample.reset_index(drop=True)

In [101]:
print(df_sample.shape)
df_sample.head()

(50000, 1)


Unnamed: 0,message
0,Message-ID: <21013688.1075844564560.JavaMail.e...
1,Message-ID: <22688499.1075854130303.JavaMail.e...
2,Message-ID: <27817771.1075841359502.JavaMail.e...
3,Message-ID: <10695160.1075858510449.JavaMail.e...
4,Message-ID: <27819143.1075853689038.JavaMail.e...


In [102]:
df_sample.to_csv("enron_sample_5000.csv", index=False)

In [103]:
df_sample['original_email'] = df_sample['message']

In [104]:
#Extract email body
def extract_body(email):
    if isinstance(email, str):
        parts = email.split("\n\n", 1)
        if len(parts) > 1:
            return parts[1]
    return email
df_sample['email_body'] = df_sample['original_email'].apply(extract_body)

In [105]:
#cleaning text
import re

def clean_text(text):
    text = text.lower()
    text = re.sub(r"http\S+", "", text)        # remove URLs
    text = re.sub(r"\n", " ", text)            # remove newlines
    text = re.sub(r"[^a-z\s]", "", text)       # remove symbols
    text = re.sub(r"\s+", " ", text).strip()
    return text
df_sample['clean_email'] = df_sample['email_body'].apply(clean_text)
df_sample.head()

Unnamed: 0,message,original_email,email_body,clean_email
0,Message-ID: <21013688.1075844564560.JavaMail.e...,Message-ID: <21013688.1075844564560.JavaMail.e...,Bill: Thanks for the info. I also spoke wit...,bill thanks for the info i also spoke with jef...
1,Message-ID: <22688499.1075854130303.JavaMail.e...,Message-ID: <22688499.1075854130303.JavaMail.e...,"Aimee,\nPlease check meter #1591 Lamay gas lif...",aimee please check meter lamay gas lift it doe...
2,Message-ID: <27817771.1075841359502.JavaMail.e...,Message-ID: <27817771.1075841359502.JavaMail.e...,GCCA Crawfish and rip-off raffle & over-priced...,gcca crawfish and ripoff raffle overpriced pri...
3,Message-ID: <10695160.1075858510449.JavaMail.e...,Message-ID: <10695160.1075858510449.JavaMail.e...,"<<Keoni.zip>> Chris, per your request here ar...",keonizip chris per your request here are the a...
4,Message-ID: <27819143.1075853689038.JavaMail.e...,Message-ID: <27819143.1075853689038.JavaMail.e...,I'm trying to change the Receipt Meter on deal...,im trying to change the receipt meter on deal ...


In [106]:
#removing empty rows
df_sample = df_sample[df_sample['clean_email'].str.len() > 20]
df_sample = df_sample.reset_index(drop=True)
df_sample.shape

(49012, 4)

In [107]:
#final columns
df_sample = df_sample[['original_email', 'clean_email']]

In [108]:
#saving base file
df_sample.to_csv("enron_clean_for_sentiment.csv", index=False)

In [109]:
#installing and importing vader
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment import SentimentIntensityAnalyzer

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [110]:
#initializing vader
sia = SentimentIntensityAnalyzer()

In [111]:
#get sentiment score
def get_sentiment_score(text):
    return sia.polarity_scores(text)['compound']
df_sample['sentiment_score'] = df_sample['clean_email'].apply(get_sentiment_score)

In [112]:
#converting score to sentiment label
def get_sentiment_label(score):
    if score >= 0.05:
        return "Positive"
    elif score <= -0.05:
        return "Negative"
    else:
        return "Neutral"
df_sample['sentiment'] = df_sample['sentiment_score'].apply(get_sentiment_label)

In [113]:
df_sample[['clean_email', 'sentiment_score', 'sentiment']].head(10)

Unnamed: 0,clean_email,sentiment_score,sentiment
0,bill thanks for the info i also spoke with jef...,0.9881,Positive
1,aimee please check meter lamay gas lift it doe...,-0.1531,Negative
2,gcca crawfish and ripoff raffle overpriced pri...,0.9915,Positive
3,keonizip chris per your request here are the a...,0.5106,Positive
4,im trying to change the receipt meter on deal ...,0.0,Neutral
5,what if we replace section with something like...,0.9962,Positive
6,forwarded by phillip m lovehouect on pm jim li...,0.3182,Positive
7,dear mark as per our discussion at the law con...,0.8442,Positive
8,got your message last night what is up bet you...,0.9966,Positive
9,hello darron just wanted to let you know that ...,0.9886,Positive


In [114]:
#saving result
df_sample.to_csv("client_email_sentiment_results.csv", index=False)

In [115]:
#seperating features and labels
X = df_sample['clean_email']
y = df_sample['sentiment']

In [116]:
#label encoding
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

In [117]:
#train test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y_encoded, test_size=0.2, random_state=42
)

In [118]:
#tf-idf vectorization for feature extraction
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(
    max_features=5000,
    ngram_range=(1, 2),      # unigrams + bigrams
    max_df=0.9,              # ignore very common words
    min_df=5,                # ignore rare noise words
    sublinear_tf=True,       # log scaling
    stop_words='english'
)

X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

In [119]:
#model training using logistic regression
from sklearn.linear_model import LogisticRegression

class_weights = {
    0: 3.0,   # Negative (highest importance)
    1: 2.0,   # Neutral
    2: 1.0    # Positive (lowest)
}

logreg = LogisticRegression(
    C=2.0,
    class_weight=class_weights,
    max_iter=2000
)

model = logreg.fit(X_train_tfidf, y_train)

In [120]:
#model evaluation
from sklearn.metrics import accuracy_score, classification_report

y_pred = model.predict(X_test_tfidf)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred, target_names=label_encoder.classes_))

Accuracy: 0.9030908905437111
              precision    recall  f1-score   support

    Negative       0.64      0.66      0.65       910
     Neutral       0.74      0.74      0.74       847
    Positive       0.95      0.95      0.95      8046

    accuracy                           0.90      9803
   macro avg       0.78      0.78      0.78      9803
weighted avg       0.90      0.90      0.90      9803



In [121]:
#training linear svm
from sklearn.svm import LinearSVC

class_weights = {
    0: 3.0,   # Negative (highest importance)
    1: 2.0,   # Neutral
    2: 1.0    # Positive (lowest)
}

svm = LinearSVC(
    C=0.8,
    class_weight=class_weights
)

svm_model = svm.fit(X_train_tfidf, y_train)

In [122]:
#evaluating svm
from sklearn.metrics import accuracy_score, classification_report

y_pred_svm = svm_model.predict(X_test_tfidf)

print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))
print(classification_report(
    y_test, y_pred_svm, target_names=label_encoder.classes_
))

SVM Accuracy: 0.907069264510864
              precision    recall  f1-score   support

    Negative       0.71      0.62      0.66       910
     Neutral       0.74      0.71      0.72       847
    Positive       0.94      0.96      0.95      8046

    accuracy                           0.91      9803
   macro avg       0.80      0.76      0.78      9803
weighted avg       0.90      0.91      0.90      9803



In [123]:
import os

# 1. Transform all emails using trained TF-IDF
X_all_tfidf = tfidf.transform(df_sample['clean_email'])

# 2. Predict sentiment using trained SVM
df_sample['svm_sentiment_encoded'] = svm_model.predict(X_all_tfidf)

# 3. Convert encoded labels back to text labels
df_sample['svm_sentiment'] = label_encoder.inverse_transform(
    df_sample['svm_sentiment_encoded']
)

# 4. (Optional) Drop helper column
df_final = df_sample.drop(columns=['svm_sentiment_encoded'], errors='ignore')

# 5. Save results to Excel in Google Drive
output_path = "/content/drive/MyDrive/svm_email_sentiment_results.xlsx"
df_final.to_excel(output_path, index=False)

# 6. Confirm file saved
print("File saved:", os.path.exists(output_path))
print("Saved at:", output_path)


KeyboardInterrupt: 

In [124]:
#Manual prediction
def predict_email_sentiment(email_text):
    # clean text (same preprocessing used in training)
    email_text = email_text.lower()

    # vectorize
    email_tfidf = tfidf.transform([email_text])

    # predict
    pred_encoded = svm_model.predict(email_tfidf)[0]

    # decode label
    pred_label = label_encoder.inverse_transform([pred_encoded])[0]

    return pred_label


In [125]:
test_email = """
Hi team,

I am very unhappy with the delay in response.
This has impacted our project timeline badly.
Please resolve this urgently.

Regards
"""

print("Predicted Sentiment:", predict_email_sentiment(test_email))


Predicted Sentiment: Negative


In [126]:
print("Train accuracy:", svm_model.score(X_train_tfidf, y_train))
print("Test accuracy:", svm_model.score(X_test_tfidf, y_test))

Train accuracy: 0.9561580249432529
Test accuracy: 0.907069264510864


In [131]:
new_emails = pd.DataFrame({
    "email_text": [
        "This is highly unacceptable. This needs to be fixed immediately.",
        "Thank you for the quick resolution. Really appreciate it.",
        "Schedule a meeting for next week."
    ]
})

new_emails['clean_email'] = new_emails['email_text'].str.lower()

X_new = tfidf.transform(new_emails['clean_email'])
new_emails['predicted_sentiment'] = label_encoder.inverse_transform(
    svm_model.predict(X_new)
)

new_emails

Unnamed: 0,email_text,clean_email,predicted_sentiment
0,This is highly unacceptable. This needs to be ...,this is highly unacceptable. this needs to be ...,Neutral
1,Thank you for the quick resolution. Really app...,thank you for the quick resolution. really app...,Positive
2,Schedule a meeting for next week.,schedule a meeting for next week.,Neutral


In [129]:
df_sample['sentiment'].value_counts(normalize=True)

Unnamed: 0_level_0,proportion
sentiment,Unnamed: 1_level_1
Positive,0.823431
Negative,0.090794
Neutral,0.085775


In [134]:
import pandas as pd

# 1. LOAD INPUT EXCEL FILE
input_path = "/content/drive/MyDrive/sample_input_emails.xlsx"   # change path if needed
df = pd.read_excel(input_path)

# 2. BASIC CLEANING (same as training)
df['clean_email'] = df['email_text'].astype(str).str.lower()

# 3. TRANSFORM USING TRAINED TF-IDF
X_new = tfidf.transform(df['clean_email'])

# 4. PREDICT SENTIMENT USING TRAINED SVM
predictions = svm_model.predict(X_new)

# 5. DECODE LABELS
df['predicted_sentiment'] = label_encoder.inverse_transform(predictions)

# 6. SAVE OUTPUT EXCEL FILE
output_path = "/content/drive/MyDrive/predicted_email_sentiments.xlsx"
df.to_excel(output_path, index=False)

print(f"✅ Prediction complete. File saved as: {output_path}")

✅ Prediction complete. File saved as: /content/drive/MyDrive/predicted_email_sentiments.xlsx
