# Employee Sentiment Analysis
This notebook analyzes employee emails to assess sentiment, identify top-performing and at-risk employees, and predict engagement trends using NLP and machine learning.

## 1. Sentiment Labeling
Used VADER from NLTK to classify each message as Positive, Negative, or Neutral based on compound scores:
- Compound > 0.05 → Positive
- Compound < -0.05 → Negative
- Otherwise → Neutral

Labeled sentiments are saved to `test_labeled.csv`.

In [9]:
import pandas as pd
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk
nltk.download('vader_lexicon')

# Load data
df = pd.read_csv("test(in).csv")

# Initialize VADER
sia = SentimentIntensityAnalyzer()

def get_sentiment(text):
    if pd.isna(text): return "Neutral"
    score = sia.polarity_scores(text)['compound']
    if score >= 0.05:
        return "Positive"
    elif score <= -0.05:
        return "Negative"
    else:
        return "Neutral"

# Apply sentiment
df['sentiment_label'] = df['body'].apply(get_sentiment)

[nltk_data] Downloading package vader_lexicon to C:\Users\Kunal
[nltk_data]     Vishwa\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


## 2. Exploratory Data Analysis (EDA)
Explored the dataset for trends and distributions:
- **Sentiment Distribution** shows most emails are positive.
- **Monthly Trends** show fluctuation in engagement.
- **Top 10 Active Employees** are mostly Enron-related.
- **Word Count** peaks below 100.

Visualizations are saved in the `visualization/` folder.

In [10]:
import os
import matplotlib.pyplot as plt

os.makedirs("outputs", exist_ok=True)
os.makedirs("visualization", exist_ok=True)

# Sentiment trend by month
# Ensure 'date' column is in datetime format
df['date'] = pd.to_datetime(df['date'], errors='coerce')

# Now it's safe to extract the month period
df['month'] = df['date'].dt.to_period('M')

sentiment_monthly = df.groupby(['month', 'sentiment_label']).size().unstack().fillna(0)

sentiment_monthly.plot(kind='line', figsize=(10,5), title="Monthly Sentiment Trend")
plt.ylabel("Message Count")
plt.xlabel("Month")
plt.tight_layout()
plt.savefig("visualization/monthly_sentiment_trend.png")
plt.close()

# Message count per employee (top 10)
top_employees = df['from'].value_counts().head(10)
top_employees.plot(kind='barh', title='Top 10 Most Active Employees', color='skyblue')
plt.xlabel("Message Count")
plt.tight_layout()
plt.savefig("visualization/top_10_employees.png")
plt.close()

# Word count distribution
df['word_count'] = df['body'].fillna("").apply(lambda x: len(x.split()))
plt.hist(df['word_count'], bins=50, color='purple')
plt.title("Distribution of Word Counts")
plt.xlabel("Word Count")
plt.ylabel("Frequency")
plt.tight_layout()
plt.savefig("visualization/word_count_distribution.png")
plt.close()


## 3. Monthly Sentiment Score
Each message scored as:
- +1 for Positive
- -1 for Negative
- 0 for Neutral

Scores grouped by employee and month, saved to `monthly_scores.csv`.

In [11]:
def score_mapper(label):
    return {"Positive": 1, "Negative": -1, "Neutral": 0}[label]

df['score'] = df['sentiment_label'].map(score_mapper)
df['month'] = df['date'].dt.to_period('M')
df['employee'] = df['from']

monthly_scores = df.groupby(['employee', 'month'])['score'].sum().reset_index()

## 4. Employee Ranking
Ranked employees by monthly sentiment score.
- `top_3_positive.csv` has highest scoring employees.
- `top_3_negative.csv` has lowest scoring employees.

Sorting was by score (descending) and then alphabetically.

In [12]:
# Top 3 Positive
top_pos = monthly_scores.sort_values(['month','score','employee'], ascending=[True, False, True])
top_3_pos = top_pos.groupby('month').head(3)

# Top 3 Negative
top_neg = monthly_scores.sort_values(['month','score','employee'], ascending=[True, True, True])
top_3_neg = top_neg.groupby('month').head(3)


## 5. Flight Risk Identification
Employees flagged if they sent **≥ 4 negative messages** in any rolling 30-day window.
- This approach uses rolling windows without month boundaries.
- Flagged list saved in `flight_risks.csv`.

In [13]:
# Get only negative messages
df_neg = df[df['sentiment_label'] == 'Negative'][['employee', 'date']].sort_values(['employee','date'])

# Rolling 30-day window count
flight_risk = []

for emp, group in df_neg.groupby('employee'):
    dates = group['date'].tolist()
    for i in range(len(dates)):
        count = sum((dates[i] - d).days <= 30 and (dates[i] - d).days >= 0 for d in dates[max(0,i-10):i+1])
        if count >= 4:
            flight_risk.append(emp)
            break

flight_risk = list(set(flight_risk))

## 6. Predictive Modeling
Built a **Linear Regression** model to predict sentiment score:
- Features used: number of messages, average message length.
- Evaluated using MSE and R².

Model showed moderate predictive ability.

In [15]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Feature Engineering
# Ensure message length column exists
df['msg_length'] = df['body'].str.len()
features = df.groupby(['employee','month']).agg({
    'score': 'sum',
    'msg_length': 'mean',
    'body': 'count'
}).reset_index()

features.rename(columns={'body': 'msg_count', 'msg_length': 'avg_length'}, inplace=True)

X = features[['msg_count', 'avg_length']]
y = features['score']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Linear Regression Evaluation")
print("MSE:", mse)
print("R^2:", r2)

# Save labeled data
df.to_csv("outputs/test_labeled.csv", index=False)
# Save scores and rankings
monthly_scores.to_csv("outputs/monthly_scores.csv", index=False)
top_3_pos.to_csv("outputs/top_3_positive.csv", index=False)
top_3_neg.to_csv("outputs/top_3_negative.csv", index=False)
pd.DataFrame(flight_risk, columns=['employee']).to_csv("outputs/flight_risks.csv", index=False)

Linear Regression Evaluation
MSE: 3.3496574465159115
R^2: 0.7183250937343457


## 7. Conclusion
- Most messages were positive, indicating good engagement.
- Temporal patterns helped detect flight risks early.
- Predictive model works but can be enhanced with richer features.