## Dataset Overview & Safety Notes

This project uses anonymized organizational data that represents realistic employee behavior and psychological characteristics. Although no personally identifiable information is present, the data reflects real-world insider threat scenarios and must be interpreted carefully.

At this stage, the goal is not to detect risk or draw conclusions, but to understand the structure, meaning, and limitations of the available datasets. All analysis in this notebook focuses on data orientation only, avoiding assumptions about intent or malicious behavior.

## Email Dataset Overview

The email dataset captures communication-related activity generated by users within the organization. This data represents behavioral events rather than outcomes or intent.

**A. Row Meaning**

Each row in the email dataset represents a single email event associated with a user.
An event may involve sending or receiving an email and includes routing and message-level metadata.

**B. User Identifiers**

The following columns identify the user and system context:

- ID – A unique identifier for the email event
- User – An anonymized user identifier representing the employee
- PC – The workstation or device used to generate the email activity

**C. Time Granularity**

Date-Time records the timestamp of each email event

The presence of precise timestamps enables analysis at multiple temporal levels, such as hourly, daily, or weekly patterns. This allows later exploration of how user behavior changes over time rather than relying on static snapshots.

**D. Email Routing Columns**

- Email routing information describes how messages are communicated:
- From – Sender of the email
- To – Primary recipients
- CC – Carbon copy recipients
- BCC – Blind carbon copy recipients

These columns describe communication structure and network relationships rather than message content.

**E. Message Attributes**

- Message-level characteristics include:

- Size – Approximate size of the email

- Attachment – Indicator of whether the email includes an attachment

- Content – Textual or metadata representation of the email body

These attributes provide signals related to communication intensity and potential data movement, without implying malicious intent on their own.

**Key Interpretation Note**

Email data reflects behavioral patterns, such as frequency, timing, and communication structure. Individual email events should not be interpreted as risky in isolation.

## Psychometric Dataset Overview

The psychometric dataset captures user-level personality and behavioral traits. Unlike email logs, this data represents contextual characteristics rather than discrete actions.

**A. Row Meaning**

Each row in the psychometric dataset represents a single user profile or assessment snapshot.
The values describe stable or slowly changing personality traits rather than moment-to-moment behavior.

**B. User Identifiers**

- employee_name – An anonymized employee reference

- user_id – A unique user identifier

The user_id field aligns with the User column in the email dataset, enabling user-level integration across behavioral and psychometric data.

**C. Psychometric Trait Columns**

The dataset includes the following personality dimensions (commonly known as the Big Five):

- O (Openness) – Tendency toward creativity and openness to experience
- C (Conscientiousness) – Organization, reliability, and self-discipline
- E (Extraversion) – Sociability and assertiveness
- A (Agreeableness) – Cooperation and empathy
- N (Neuroticism) – Emotional stability and stress sensitivity

These traits provide contextual signals, not indicators of wrongdoing.

**Key Interpretation Note**

Psychometric traits describe propensities, not behavior. They should never be used alone to label or predict malicious activity and must be interpreted alongside technical and temporal data.

## Feature Grouping

To support later analysis, features can be grouped conceptually as follows.

**1. Email / Behavioral Features**

- Email frequency per user
- Time-of-day and day-of-week activity
- Attachment usage
- Communication patterns (To, CC, BCC usage)
- Message size trends

**2. Psychometric / Contextual Features**
- Personality trait scores (O, C, E, A, N)
- Stress or emotional sensitivity indicators (primarily N)

**3. Temporal Features**

- Changes in email behavior over time
- Pre- and post-event windows
- Short-term vs long-term activity trends

## User-Level vs Event-Level Data

Understanding the distinction between data levels is critical for correct analysis.

**1. Event-Level Data**

The email dataset is event-level, where each row represents a single user action. This data is high-frequency and suitable for detecting short-term behavioral changes and anomalies.

**2. User-Level Data**

The psychometric dataset is user-level, representing characteristics associated with a user rather than individual actions. These attributes change slowly, if at all.

**3. Why This Distinction Matters**

Event-level data must be aggregated over time to align meaningfully with user-level attributes. Treating event-level and user-level data as equivalent can lead to misleading conclusions.

## Key Takeaways for Next Steps

This dataset combines high-volume behavioral email events with lower-frequency psychometric context. Proper analysis requires careful aggregation, temporal awareness, and ethical interpretation.

The next stage of the project will focus on exploratory analysis to understand baseline email behavior across users and how patterns vary over time, without assigning risk labels or drawing premature conclusions.