# 🚀 EXECUTION INSTRUCTIONS

**⚠️ IMPORTANT: Run cells in this exact order to avoid errors:**

1. **Import Libraries** (Cell 2) - Run first
2. **Helper Functions** (Cell 3) - Run second  
3. **Load Dataset** (Cell 5) - Run third
4. **Preprocess Text** (Cell 7) - Run fourth
5. **Feature Extraction** (Cell 9) - Run fifth
6. **Choose your training approach:**
   - **Enhanced Logistic Regression** (Cell 10) - **RECOMMENDED** for Streamlit compatibility
   - **Standard Logistic Regression** (Cell 11) - Alternative approach
   - **BiLSTM Model** (Cell 13) - Deep learning approach
   - **BERT Model** (Cell 16) - Transformer approach

**✅ For Streamlit app compatibility, make sure to run Cell 10 (Enhanced Logistic Regression)**

---

# AI Text Detector

This notebook demonstrates how to detect whether text is AI-generated or human-written using various machine learning and deep learning models. We will:
- Load and preprocess a labeled dataset
- Extract features (TF-IDF, embeddings, BERT)
- Train and evaluate Logistic Regression, SVM, BiLSTM, and BERT models
- Compare their performance using accuracy, precision, recall, and F1 score

In [1]:
# Import Required Libraries
import pandas as pd
import numpy as np
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
import torch

# Download NLTK resources
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mohgu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mohgu\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mohgu\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mohgu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mohgu\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mohgu\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
# Define helper functions
def show_metrics(name, y_true, y_pred):
    """Display evaluation metrics for a model"""
    print(f'--- {name} ---')
    print('Accuracy:', accuracy_score(y_true, y_pred))
    print('Precision:', precision_score(y_true, y_pred))
    print('Recall:', recall_score(y_true, y_pred))
    print('F1 Score:', f1_score(y_true, y_pred))
    print()

def simple_preprocess(text):
    """Simple text preprocessing for quick training"""
    return str(text).lower().strip()

## Load Labeled Dataset

Upload your CSV file with columns `text` and `label` (0 = Human, 1 = AI).

In [3]:
# Load the dataset
import os

# Try different paths for the CSV file
csv_paths = [
    'AI_Human.csv',
    'AI_Human.csv/AI_Human.csv',  # In case it's in a subfolder
    os.path.join(os.getcwd(), 'AI_Human.csv')
]

df = None
for path in csv_paths:
    try:
        if os.path.exists(path):
            df = pd.read_csv(path)
            print(f"Successfully loaded dataset from: {path}")
            break
    except PermissionError:
        print(f"Permission denied for: {path}")
        continue
    except Exception as e:
        print(f"Error loading {path}: {e}")
        continue

if df is None:
    print("Could not load dataset. Please check:")
    print("1. Close Excel or any program that has the CSV file open")
    print("2. Make sure the file exists in the current directory")
    print("3. Check file permissions")
    print(f"Current directory: {os.getcwd()}")
    print(f"Files in directory: {os.listdir('.')}")
else:
    print(f"Dataset shape: {df.shape}")
    df.head()

Permission denied for: AI_Human.csv
Successfully loaded dataset from: AI_Human.csv/AI_Human.csv
Dataset shape: (487235, 2)
Successfully loaded dataset from: AI_Human.csv/AI_Human.csv
Dataset shape: (487235, 2)


## Preprocess Text

Clean, tokenize, lemmatize, and remove stopwords from the text data.

In [4]:
# Ensure NLTK 'punkt' and 'punkt_tab' resources are downloaded
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

# Preprocessing functions
def clean_text(text):
    text = text.lower()
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    text = re.sub(r'\S+@\S+', '', text)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
def preprocess_text(text):
    text = clean_text(text)
    tokens = nltk.word_tokenize(text)
    tokens = [lemmatizer.lemmatize(w) for w in tokens if w not in stop_words]
    return ' '.join(tokens)

# Check if 'generated' column exists before proceeding
if 'generated' in df.columns:
    df['text_clean'] = df['text'].apply(preprocess_text)
    print(df[['text', 'text_clean', 'generated']].head())
else:
    print("Error: 'generated' column is missing from the dataset. Please check the dataset structure.")
    print("Available columns:", df.columns)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mohgu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\mohgu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


                                                text  \
0  Cars. Cars have been around since they became ...   
1  Transportation is a large necessity in most co...   
2  "America's love affair with it's vehicles seem...   
3  How often do you ride in a car? Do you drive a...   
4  Cars are a wonderful thing. They are perhaps o...   

                                          text_clean  generated  
0  car car around since became famous henry ford ...        0.0  
1  transportation large necessity country worldwi...        0.0  
2  america love affair vehicle seems cooling say ...        0.0  
3  often ride car drive one motor vehicle work st...        0.0  
4  car wonderful thing perhaps one world greatest...        0.0  


## Feature Extraction and Train-Test Split

We will extract features using TF-IDF and split the data into training and test sets.

In [5]:
# Ensure required columns exist
if 'text_clean' in df.columns and 'generated' in df.columns:
    # Extract features using TF-IDF and split the data
    X = df['text_clean']
    y = df['generated']

    # Split into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # TF-IDF Vectorization
    vectorizer = TfidfVectorizer(max_features=5000)
    X_train_tfidf = vectorizer.fit_transform(X_train)
    X_test_tfidf = vectorizer.transform(X_test)

    print('TF-IDF feature shape:', X_train_tfidf.shape)
else:
    print("Error: Required columns 'text_clean' and/or 'generated' are missing from the DataFrame.")
    print("Available columns:", df.columns)

TF-IDF feature shape: (389788, 5000)


In [None]:
# Enhanced Logistic Regression Training (Alternative Method)
# This cell provides an alternative, comprehensive approach to training logistic regression

import joblib  # Add missing import for model saving

# Prepare data using simple preprocessing
print("Preparing data for enhanced logistic regression...")
df_copy = df.copy()  # Work with a copy to avoid conflicts
df_copy['text_simple'] = df_copy['text'].apply(simple_preprocess)

# Split data
X_simple = df_copy['text_simple']
y_simple = df_copy['generated']
X_train_simple, X_test_simple, y_train_simple, y_test_simple = train_test_split(
    X_simple, y_simple, test_size=0.2, random_state=42
)

# Create TF-IDF vectorizer (compatible with Streamlit app)
print("Creating enhanced TF-IDF vectorizer...")
tfidf_vectorizer_enhanced = TfidfVectorizer(
    max_features=5000, 
    ngram_range=(1, 2), 
    stop_words='english'
)
X_train_tfidf_enhanced = tfidf_vectorizer_enhanced.fit_transform(X_train_simple)
X_test_tfidf_enhanced = tfidf_vectorizer_enhanced.transform(X_test_simple)

# Train Enhanced Logistic Regression
print("Training enhanced logistic regression...")
lr_enhanced = LogisticRegression(max_iter=200, random_state=42)
lr_enhanced.fit(X_train_tfidf_enhanced, y_train_simple)

# Evaluate
y_pred_enhanced = lr_enhanced.predict(X_test_tfidf_enhanced)
show_metrics('Enhanced Logistic Regression (TF-IDF)', y_test_simple, y_pred_enhanced)

# Save models (these will be compatible with Streamlit)
joblib.dump(tfidf_vectorizer_enhanced, "tfidf_vectorizer.pkl")
joblib.dump(lr_enhanced, "logistic_regression_model.pkl")
print("Enhanced TF-IDF vectorizer saved as 'tfidf_vectorizer.pkl'")
print("Enhanced Logistic Regression model saved as 'logistic_regression_model.pkl'")

# Save predictions
lr_enhanced_output_df = pd.DataFrame({
    'True Label': y_test_simple,
    'Predicted Label': y_pred_enhanced
})
lr_enhanced_output_df.to_csv('logistic_regression_predictions.csv', index=False)
print("Enhanced Logistic Regression predictions saved to 'logistic_regression_predictions.csv'.")

Preparing data for enhanced logistic regression...
Creating enhanced TF-IDF vectorizer...
Creating enhanced TF-IDF vectorizer...
Training enhanced logistic regression...
Training enhanced logistic regression...
--- Enhanced Logistic Regression (TF-IDF) ---
Accuracy: 0.993268135499297
Precision: 0.9951977793199167
Recall: 0.9867070317875327
F1 Score: 0.9909342177998894

--- Enhanced Logistic Regression (TF-IDF) ---
Accuracy: 0.993268135499297
Precision: 0.9951977793199167
Recall: 0.9867070317875327
F1 Score: 0.9909342177998894



NameError: name 'joblib' is not defined

In [None]:
# Train Standard Logistic Regression (using preprocessed data)
print("Training standard logistic regression...")

# Check if we have the required variables from previous cells
if 'X_train_tfidf' in locals() and 'y_train' in locals():
    lr = LogisticRegression(max_iter=200)
    lr.fit(X_train_tfidf, y_train)
    y_pred_lr = lr.predict(X_test_tfidf)
    
    # Evaluate Logistic Regression
    show_metrics('Standard Logistic Regression (TF-IDF)', y_test, y_pred_lr)
    
    # Save the standard model (backup)
    joblib.dump(lr, "logistic_regression_standard.pkl")
    print("Standard Logistic Regression model saved as 'logistic_regression_standard.pkl'.")
    
    # Save predictions
    lr_output_df = pd.DataFrame({
        'True Label': y_test,
        'Predicted Label': y_pred_lr
    })
    lr_output_df.to_csv('logistic_regression_standard_predictions.csv', index=False)
    print("Standard Logistic Regression predictions saved.")
else:
    print("Warning: Required variables not found. Please run the preprocessing cells first.")
    print("Available variables:", [var for var in locals().keys() if not var.startswith('_')])

## Train and Evaluate BiLSTM Model

We will use a Bidirectional LSTM (BiLSTM) neural network with word embeddings to classify the text.

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Define max_words and max_len if not already defined
max_words = 5000
max_len = 100

# Tokenize and pad sequences
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(X_train)
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)
X_train_pad = pad_sequences(X_train_seq, maxlen=max_len)
X_test_pad = pad_sequences(X_test_seq, maxlen=max_len)

# Build BiLSTM model
model = Sequential([
    Embedding(input_dim=max_words, output_dim=64, input_length=max_len),
    Bidirectional(LSTM(32)),
    Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train BiLSTM
model.fit(X_train_pad, y_train, epochs=2, batch_size=32, verbose=1)
y_pred_bilstm = (model.predict(X_test_pad) > 0.5).astype(int).flatten()

# Evaluate BiLSTM
show_metrics('BiLSTM', y_test, y_pred_bilstm)

# Save the BiLSTM model
model.save("bilstm_model.h5")
print("BiLSTM model saved as 'bilstm_model.h5'.")

# Save the tokenizer for BiLSTM
import joblib
joblib.dump(tokenizer, "bilstm_tokenizer.pkl")
print("BiLSTM tokenizer saved as 'bilstm_tokenizer.pkl'.")

In [None]:
# Save predictions and true labels to CSV files for all models
import pandas as pd

# Ensure predictions and true labels are available
if 'y_test' in locals():
    # Save BiLSTM predictions
    if 'y_pred_bilstm' in locals():
        bilstm_output_df = pd.DataFrame({
            'True Label': y_test,
            'Predicted Label': y_pred_bilstm
        })
        bilstm_output_df.to_csv('bilstm_predictions.csv', index=False)
        print("BiLSTM predictions saved to 'bilstm_predictions.csv'.")

    # Save Logistic Regression predictions
    if 'y_pred_lr' in locals():
        lr_output_df = pd.DataFrame({
            'True Label': y_test,
            'Predicted Label': y_pred_lr
        })
        lr_output_df.to_csv('logistic_regression_predictions.csv', index=False)
        print("Logistic Regression predictions saved to 'logistic_regression_predictions.csv'.")

    # Save SVM predictions
    if 'y_pred_svm' in locals():
        svm_output_df = pd.DataFrame({
            'True Label': y_test,
            'Predicted Label': y_pred_svm
        })
        svm_output_df.to_csv('svm_predictions.csv', index=False)
        print("SVM predictions saved to 'svm_predictions.csv'.")

    # Save BERT predictions
    if 'y_pred_bert' in locals():
        bert_output_df = pd.DataFrame({
            'True Label': y_test,
            'Predicted Label': y_pred_bert
        })
        bert_output_df.to_csv('bert_predictions.csv', index=False)
        print("BERT predictions saved to 'bert_predictions.csv'.")
else:
    print("Error: True labels are not available. Train and evaluate the models first.")

## Train and Evaluate BERT Model

We will use a pre-trained BERT model (via Hugging Face Transformers) to classify the text. This approach uses transformer-based embeddings and fine-tuning.

In [None]:
# Simple BERT-like approach using tokenization and neural network
from transformers import AutoTokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Embedding, GlobalAveragePooling1D
import tensorflow as tf
import numpy as np
from tqdm import tqdm

# Load BERT tokenizer
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Encode texts using BERT tokenizer
def simple_bert_encode(texts, tokenizer, max_len=64):
    input_ids = []
    for text in tqdm(texts, desc="Encoding Texts"):
        encoded = tokenizer.encode(
            text,
            add_special_tokens=True,
            max_length=max_len,
            padding='max_length',
            truncation=True,
        )
        input_ids.append(encoded)
    return np.array(input_ids)

# Tokenize training and test data
X_train_bert = simple_bert_encode(X_train, bert_tokenizer)
X_test_bert = simple_bert_encode(X_test, bert_tokenizer)

print(f"Shape of X_train_bert: {X_train_bert.shape}")
print(f"Shape of X_test_bert: {X_test_bert.shape}")

# Build a simple neural network model
bert_model = Sequential([
    Embedding(input_dim=bert_tokenizer.vocab_size, output_dim=128, input_length=64),
    GlobalAveragePooling1D(),
    Dense(64, activation='relu'),
    Dropout(0.3),
    Dense(32, activation='relu'),
    Dropout(0.3),
    Dense(1, activation='sigmoid')
])

# Compile the model
bert_model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
    loss=tf.keras.losses.BinaryCrossentropy(),
    metrics=['accuracy']
)

# Train the model
print("Training Simple BERT-based Model...")
bert_model.fit(
    X_train_bert, y_train,
    epochs=3,
    batch_size=32,
    validation_split=0.1,
    verbose=1
)

# Predict using the model
y_pred_bert = (bert_model.predict(X_test_bert) > 0.5).astype(int).flatten()

# Evaluate the model
show_metrics('Simple BERT-based Model', y_test, y_pred_bert)

# Save the BERT model
bert_model.save("bert_model.h5")
print("BERT model saved as 'bert_model.h5'.")

# Save the BERT tokenizer
import joblib
joblib.dump(bert_tokenizer, "bert_tokenizer.pkl")
print("BERT tokenizer saved as 'bert_tokenizer.pkl'.")

In [None]:
# Debugging: Check shapes of tokenized inputs
if 'X_train_bert' in locals():
    print("Shape of X_train_bert:", X_train_bert.shape)
    print("Shape of X_test_bert:", X_test_bert.shape)
    print("Sample of X_train_bert[0]:", X_train_bert[0][:10])  # Show first 10 tokens
else:
    print("X_train_bert not defined yet. Run the BERT encoding cell first.")

# Debugging: Check if BERT tokenizer is working
try:
    from transformers import AutoTokenizer
    bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
    print("BERT tokenizer loaded successfully.")
    print("Vocab size:", bert_tokenizer.vocab_size)
except Exception as e:
    print("Error loading BERT tokenizer:", e)

In [None]:
# Train SVM
print("Training SVM...")
svm = SVC()
svm.fit(X_train_tfidf, y_train)
y_pred_svm = svm.predict(X_test_tfidf)

# Evaluate SVM
show_metrics('SVM (TF-IDF)', y_test, y_pred_svm)

# Save the SVM model
import joblib
joblib.dump(svm, "svm_model.pkl")
print("SVM model saved as 'svm_model.pkl'.")

In [None]:
from sklearn.linear_model import SGDClassifier
from tqdm import tqdm

# Train SGDClassifier with progress tracking
print("Training SGDClassifier...")
sgd = SGDClassifier()
for epoch in tqdm(range(10), desc="SGD Training Progress"):
    sgd.partial_fit(X_train_tfidf, y_train, classes=np.unique(y_train))

y_pred_sgd = sgd.predict(X_test_tfidf)

# Evaluate SGDClassifier
show_metrics('SGDClassifier (TF-IDF)', y_test, y_pred_sgd)

In [None]:
# Save the TF-IDF vectorizer for later use
import joblib

# Ensure the vectorizer is trained before saving
if 'vectorizer' in locals():
    joblib.dump(vectorizer, "tfidf_vectorizer.pkl")
    print("TF-IDF vectorizer saved as 'tfidf_vectorizer.pkl'.")
else:
    print("Error: TF-IDF vectorizer is not defined. Train the vectorizer before saving.")