Skip to content

Supreet37/AI-Text-Detector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ” AI vs Human Text Detector

Python 3.13+ scikit-learn License

A Production-Ready Machine Learning System to Detect AI-Generated Content with 87% Accuracy


πŸ“– Table of Contents

  1. Problem Statement
  2. Why This Matters
  3. Dataset Overview
  4. Model Selection & Comparison
  5. Why Logistic Regression?
  6. Technical Architecture
  7. Feature Engineering
  8. Model Performance
  9. Installation & Usage
  10. Live Demo
  11. Future Improvements

🎯 Problem Statement

Can we reliably distinguish between text written by humans and text generated by AI models like GPT-4, Claude, or Gemini?

With the rapid advancement of Large Language Models (LLMs), AI-generated content has become increasingly sophisticated and harder to detect. This poses challenges for:

  • Academic Integrity - Detecting AI-written essays and assignments
  • Content Moderation - Identifying AI-generated spam and fake news
  • Trust & Authenticity - Verifying human-generated content online

This project builds a traditional Machine Learning classifier (not Deep Learning) to solve this binary classification problem using TF-IDF features and a Logistic Regression model.


⚠️ Why This Matters

Domain Challenge Our Solution
Education Students using AI to write essays Detect AI-generated submissions
Journalism AI-written news articles Verify authentic reporting
SEO Content AI-generated blog spam Filter low-quality content
Social Media Bot-generated comments Identify automated accounts

πŸ“Š Dataset Overview

Dataset 1: AI vs Human Content (20,000 rows, 13 columns)

Column Description Type
content The actual text to analyze Text
label Target variable ('human' or 'ai') Category
source Content source (news, blog, academic, etc.) Category
topic Subject matter Category
word_count Number of words Numeric
char_count Number of characters Numeric
complexity_score Readability score (1-10) Numeric
ai_model Which AI generated it (NaN for human) Category

Dataset 2: Sentence Dataset (9.8 million rows)

  • Extracted sentences from essays
  • Used for final model training
  • Provides 500K sampled sentences for balanced training

Key Insight from EDA

  • Correlation between numeric features and label:
  • word_count: 0.05 (very weak)
  • char_count: 0.07 (very weak)
  • complexity_score: 0.001 (almost zero)
  • Conclusion: Numeric features DON'T help prediction.
  • Only TEXT content matters!

🧠 Model Selection & Comparison

I tested 4 different machine learning models on 437,000 training samples to find the best performer:

Model Accuracy Training Time F1-Score Precision Recall
Logistic Regression 87.12% 8.5 sec 0.87 0.87 0.87
RidgeClassifier 87.12% 3.5 sec 0.87 0.87 0.87
MultinomialNB 84.75% 0.2 sec 0.85 0.84 0.85
SGDClassifier 83.96% 1.4 sec 0.84 0.84 0.84

Model Details

Model Type Pros Cons
Logistic Regression Linear Fast, interpretable, works well with sparse data May underfit complex patterns
RidgeClassifier Linear Similar to LR, slightly faster No probability outputs
MultinomialNB Naive Bayes Very fast, good for text Assumes feature independence
SGDClassifier Linear Handles large datasets well Requires careful tuning

βœ… Why Logistic Regression?

After extensive testing, Logistic Regression was chosen as the final model for several reasons:

1. Performance

Accuracy: 90.50% Precision: 87% (when model says AI, it's right 87% of the time) Recall: 74% for AI, 94% for Human F1-Score: 0.90

2. Interpretability

Unlike black-box models (Neural Networks, Random Forests), Logistic Regression provides:

  • Feature importance (which words matter most)
  • Clear decision boundary
  • Easy to debug and explain

3. Speed

  • Training time: 8.5 seconds on 437,000 samples
  • Inference time: < 1 second per text
  • Suitable for real-time applications

4. Sparse Data Handling

TF-IDF creates sparse matrices (95% zeros). Logistic Regression handles sparse data efficiently.

5. Probability Outputs

Provides confidence scores (0-100%) for each prediction.

6. No Overfitting

With L2 regularization, Logistic Regression generalizes well to unseen data.

7. Production Ready

  • Small model size (~5MB)
  • Easy to deploy
  • Works without GPU

πŸ”§ Feature Engineering

TF-IDF Vectorization Parameters

Parameter Value Why
max_features 10,000 Keep only top 10,000 most important words
ngram_range (1,2) Capture single words AND word pairs (e.g., "not good")
stop_words 'english' Remove common words like 'the', 'and', 'is'
min_df 2 Ignore words that appear less than twice
max_df 0.95 Ignore words that appear in >95% of texts

Why TF-IDF?

Method What it does Why used
Term Frequency (TF) How often a word appears Captures word importance in document
Inverse Document Frequency (IDF) How rare a word is across all documents Reduces weight of common words
TF-IDF Score TF Γ— IDF High score = important word for this document

Example TF-IDF in Action

  • Document 1: "The cat sat on the mat"

  • Document 2: "The dog sat on the log"

  • Word 'cat' appears only in Doc1 β†’ High IDF β†’ High importance

  • Word 'the' appears in both docs β†’ Low IDF β†’ Low importance

πŸš€ Installation & Usage

Prerequisites

Python 3.13+
pip package manager

# 1. Clone the repository
git clone https://github.com/Nerdy37/AI-Text-Detector.git
cd AI-Text-Detector

# 2. Install dependencies
pip install -r requirements.txt

# 3. Run the application
python app.py

# 4. Open browser to http://localhost:7860

live demo

![veritext]

About

AI vs Human Text Detection using Machine Learning

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors