🔍 AI vs Human Text Detector

A Production-Ready Machine Learning System to Detect AI-Generated Content with 87% Accuracy

📖 Table of Contents

Problem Statement
Why This Matters
Dataset Overview
Model Selection & Comparison
Why Logistic Regression?
Technical Architecture
Feature Engineering
Model Performance
Installation & Usage
Live Demo
Future Improvements

🎯 Problem Statement

Can we reliably distinguish between text written by humans and text generated by AI models like GPT-4, Claude, or Gemini?

With the rapid advancement of Large Language Models (LLMs), AI-generated content has become increasingly sophisticated and harder to detect. This poses challenges for:

Academic Integrity - Detecting AI-written essays and assignments
Content Moderation - Identifying AI-generated spam and fake news
Trust & Authenticity - Verifying human-generated content online

This project builds a traditional Machine Learning classifier (not Deep Learning) to solve this binary classification problem using TF-IDF features and a Logistic Regression model.

⚠️ Why This Matters

Domain	Challenge	Our Solution
Education	Students using AI to write essays	Detect AI-generated submissions
Journalism	AI-written news articles	Verify authentic reporting
SEO Content	AI-generated blog spam	Filter low-quality content
Social Media	Bot-generated comments	Identify automated accounts

📊 Dataset Overview

Dataset 1: AI vs Human Content (20,000 rows, 13 columns)

Column	Description	Type
`content`	The actual text to analyze	Text
`label`	Target variable ('human' or 'ai')	Category
`source`	Content source (news, blog, academic, etc.)	Category
`topic`	Subject matter	Category
`word_count`	Number of words	Numeric
`char_count`	Number of characters	Numeric
`complexity_score`	Readability score (1-10)	Numeric
`ai_model`	Which AI generated it (NaN for human)	Category

Dataset 2: Sentence Dataset (9.8 million rows)

Extracted sentences from essays
Used for final model training
Provides 500K sampled sentences for balanced training

Key Insight from EDA

Correlation between numeric features and label:
word_count: 0.05 (very weak)
char_count: 0.07 (very weak)
complexity_score: 0.001 (almost zero)
Conclusion: Numeric features DON'T help prediction.
Only TEXT content matters!

🧠 Model Selection & Comparison

I tested 4 different machine learning models on 437,000 training samples to find the best performer:

Model	Accuracy	Training Time	F1-Score	Precision	Recall
Logistic Regression	87.12%	8.5 sec	0.87	0.87	0.87
RidgeClassifier	87.12%	3.5 sec	0.87	0.87	0.87
MultinomialNB	84.75%	0.2 sec	0.85	0.84	0.85
SGDClassifier	83.96%	1.4 sec	0.84	0.84	0.84

Model Details

Model	Type	Pros	Cons
Logistic Regression	Linear	Fast, interpretable, works well with sparse data	May underfit complex patterns
RidgeClassifier	Linear	Similar to LR, slightly faster	No probability outputs
MultinomialNB	Naive Bayes	Very fast, good for text	Assumes feature independence
SGDClassifier	Linear	Handles large datasets well	Requires careful tuning

✅ Why Logistic Regression?

After extensive testing, Logistic Regression was chosen as the final model for several reasons:

1. Performance

Accuracy: 90.50% Precision: 87% (when model says AI, it's right 87% of the time) Recall: 74% for AI, 94% for Human F1-Score: 0.90

2. Interpretability

Unlike black-box models (Neural Networks, Random Forests), Logistic Regression provides:

Feature importance (which words matter most)
Clear decision boundary
Easy to debug and explain

3. Speed

Training time: 8.5 seconds on 437,000 samples
Inference time: < 1 second per text
Suitable for real-time applications

4. Sparse Data Handling

TF-IDF creates sparse matrices (95% zeros). Logistic Regression handles sparse data efficiently.

5. Probability Outputs

Provides confidence scores (0-100%) for each prediction.

6. No Overfitting

With L2 regularization, Logistic Regression generalizes well to unseen data.

7. Production Ready

Small model size (~5MB)
Easy to deploy
Works without GPU

🔧 Feature Engineering

TF-IDF Vectorization Parameters

Parameter	Value	Why
`max_features`	10,000	Keep only top 10,000 most important words
`ngram_range`	(1,2)	Capture single words AND word pairs (e.g., "not good")
`stop_words`	'english'	Remove common words like 'the', 'and', 'is'
`min_df`	2	Ignore words that appear less than twice
`max_df`	0.95	Ignore words that appear in >95% of texts

Why TF-IDF?

Method	What it does	Why used
Term Frequency (TF)	How often a word appears	Captures word importance in document
Inverse Document Frequency (IDF)	How rare a word is across all documents	Reduces weight of common words
TF-IDF Score	TF × IDF	High score = important word for this document

Example TF-IDF in Action

Document 1: "The cat sat on the mat"
Document 2: "The dog sat on the log"
Word 'cat' appears only in Doc1 → High IDF → High importance
Word 'the' appears in both docs → Low IDF → Low importance

🚀 Installation & Usage

Prerequisites

Python 3.13+
pip package manager

# 1. Clone the repository
git clone https://github.com/Nerdy37/AI-Text-Detector.git
cd AI-Text-Detector

# 2. Install dependencies
pip install -r requirements.txt

# 3. Run the application
python app.py

# 4. Open browser to http://localhost:7860

live demo

![veritext]

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
static		static
templates		templates
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt
sentence_model.pkl		sentence_model.pkl
vectorizer.pkl		vectorizer.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔍 AI vs Human Text Detector

📖 Table of Contents

🎯 Problem Statement

⚠️ Why This Matters

📊 Dataset Overview

Dataset 1: AI vs Human Content (20,000 rows, 13 columns)

Dataset 2: Sentence Dataset (9.8 million rows)

Key Insight from EDA

🧠 Model Selection & Comparison

Model Details

✅ Why Logistic Regression?

1. Performance

2. Interpretability

3. Speed

4. Sparse Data Handling

5. Probability Outputs

6. No Overfitting

7. Production Ready

🔧 Feature Engineering

TF-IDF Vectorization Parameters

Why TF-IDF?

Example TF-IDF in Action

🚀 Installation & Usage

Prerequisites

live demo

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🔍 AI vs Human Text Detector

📖 Table of Contents

🎯 Problem Statement

⚠️ Why This Matters

📊 Dataset Overview

Dataset 1: AI vs Human Content (20,000 rows, 13 columns)

Dataset 2: Sentence Dataset (9.8 million rows)

Key Insight from EDA

🧠 Model Selection & Comparison

Model Details

✅ Why Logistic Regression?

1. Performance

2. Interpretability

3. Speed

4. Sparse Data Handling

5. Probability Outputs

6. No Overfitting

7. Production Ready

🔧 Feature Engineering

TF-IDF Vectorization Parameters

Why TF-IDF?

Example TF-IDF in Action

🚀 Installation & Usage

Prerequisites

live demo

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages