
# 📊 POS Tagging Analysis

## Project Overview

This notebook supports the study: **Detecting AI-Generated Job Scams: A Human
and Machine Perspective**.

We aim to explore linguistic differences in job postings using **Part-of-Speech
(POS) tagging**, to determine if AI-generated job scams differ from real or
human-written fake listings.

---

## Datasets Used

- ✅ Real Job Postings (17,014)
- ❌ Human-Written Fake Job Postings (866)
- 🤖 AI-Refined Fake Job Postings (866)

Each entry includes the job `title`, `description`, and `requirements`.

---


In [None]:
import pandas as pd
import spacy
from collections import Counter
import matplotlib.pyplot as plt
import seaborn as sns

### 📦 Load & Combine Datasets

In [None]:
# Load datasets
real_jobs = pd.read_csv('1_datasets/real_jobs.csv')
fake_jobs = pd.read_csv('1_datasets/fake_jobs.csv')
llm_refined = pd.read_csv('1_datasets/llm_refined_fake_posts2.csv')


# Add source tags and unify text

real_jobs['source'] = 'real'
real_jobs['text'] = real_jobs[
    ['title', 'description', 'requirements']
].fillna('').agg(
    ' '.join,
    axis=1
)

fake_jobs['source'] = 'human_fake'
fake_jobs['text'] = fake_jobs[
    ['title', 'description', 'requirements']
].fillna('').agg(
    ' '.join,
    axis=1
)

llm_refined['source'] = 'ai_fake'
llm_refined['text'] = llm_refined[
    ['title', 'refined_description', 'refined_requirements']
].fillna('').agg(
    ' '.join,
    axis=1
)

# Combine all
df = pd.concat([
    real_jobs[['text', 'source']],
    fake_jobs[['text', 'source']],
    llm_refined[['text', 'source']]
], ignore_index=True)

# Preview combined dataframe
df.head()

### 🧠 POS Tagging
We will now apply POS tagging using spaCy to analyze linguistic patterns.

In [None]:
nlp = spacy.load("en_core_web_sm")

def get_pos_tags(text):
    

    doc = nlp(text)
    return [token.pos_ for token in doc]

# Sample 300 posts from each source for speed
sampled_df = (
    df.groupby("source")
    .apply(lambda x: x.sample(n=300, random_state=42))
    .reset_index(drop=True)
)

sampled_df['pos_tags'] = sampled_df['text'].apply(get_pos_tags)
sampled_df.head(2)

### 📈 POS Tag Frequency Comparison

In [None]:
# Aggregate POS tag frequencies

def count_pos(pos_list):

    return Counter(pos_list)

sampled_df['pos_counts'] = sampled_df['pos_tags'].apply(count_pos)

# Convert to long format
pos_long = []

for _, row in sampled_df.iterrows():
    
    for tag, count in row['pos_counts'].items():
        pos_long.append({'source': row['source'], 'pos': tag, 'count': count})


pos_df = pd.DataFrame(pos_long)

# Normalize by post
pos_df = pos_df.groupby(['source', 'pos'])['count'].mean().reset_index()

# Plot
plt.figure(figsize=(12, 6))
sns.barplot(data=pos_df, x='pos', y='count', hue='source')
plt.title("Average POS Tag Frequency by Source (Sampled)")
plt.ylabel("Average Count per Post")
plt.show()


---

## 🧾 Non-Technical Explanation

We used a linguistic technique called **POS tagging** to see how job postings
use language differently.

- **Real listings** use more **proper nouns** and job-related **verbs**.
- **Human fakes** may be less polished, with irregular structure.
- **AI fakes** are more consistent and overly clean, showing more **determiners
(DET)** and **adjectives (ADJ)** — which suggests persuasive language.

These trends suggest that AI-generated scams are polished but predictable.

### Uncertainty

- POS tagging is structural — it can miss sarcasm, context, or domain-specific
usage.
- Dataset imbalance and LLM-refined samples may contain hidden bias.

---


## ⚙️ Technical Summary

- We sampled 300 posts per class to ensure balance and reduce compute time.
- We used `spaCy` for POS tagging and `Counter` to count frequencies.
- Data was visualized using `Seaborn`.

### Limitations

- Fake data is rare (5%) in the original dataset — this affects real-world
generalizability.
- AI-refined texts were generated with assumptions and may not mimic true
scammer behavior.
- POS analysis is shallow — semantic meaning or intent needs deeper models
(like transformer-based analysis).

### Alternatives

- Token-level classification (NER, syntax trees)
- Transformer embeddings + clustering
- Stylometry or readability scoring

---


## ✅ Conclusion

POS tagging reveals meaningful trends between fake and real listings. Real job
descriptions reflect genuine organizational language, while AI-fake ones appear
structured and persuasive.

This supports our hypothesis: **AI-generated scams differ linguistically**,
making POS tagging one of the early-stage tools for detection.

> Fraud detection is now a battle between AI-generated content and AI-powered
detection — and POS tagging provides one lens into that ongoing war.

---

---