# Detect AI-generated Content

### Definition
- **AI-generated content** refers to the output (text, code, images, etc.) produced by GenAI / Large Language Models (LLMs) that are trained on big data containing a wide range of writing styles, topics, and sources, triggered by input prompt and are capable of generating human-like text. These LLMs have the ability to understand context and  mimic the writing style of specific individuals or sources to generate text that is remarkably coherent and contextually appropriate. 


### Benefits 
- AI is your copilot / virtual assistant for productivity & creativity
- AI-generated text opens up a new door to content creation, customer service, and automated journalism.

### Risks
- With the rise of ChatGPT (LLMs in general), AI-generated text has become increasingly indistinguishable from human-generated content. Without checks and balances, some people will use AI irresponsibly. This raises concerns about the potential **misuse of AI for spreading disinformation or generating fraudulent content**, which requires businesses to take extra steps to ensure employees use AI responsibly and ethically.



In [1]:
%matplotlib inline

import matplotlib
#matplotlib.use( 'tkagg' )
import matplotlib.pyplot as plt
import seaborn as sns

import os, time
import numpy as np
import openai
import pandas as pd
import pickle
import tiktoken
import streamlit as st
from PIL import Image
import requests
#import config
#import pyttsx3
import speech_recognition as sr
print(sr.__version__)
# Use TTS to generate synthetic speech
import pyaudio
from TTS.api import TTS
import io
import IPython.display as ipd
from playsound import playsound
#Langchain
from langchain import OpenAI, PromptTemplate, LLMChain, ConversationChain
from langchain.memory import ConversationBufferWindowMemory
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
# Import Azure OpenAI
from langchain.llms import AzureOpenAI

3.10.0


OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.


In [2]:
# Authentication & Authorization using API Key option

#Azure OpenAI Credential
os.environ["OPENAI_API_KEY"] = "<FILL IN YOUR API KEY>"
os.environ["AZURE_OPENAI_ENDPOINT"] = "https://<FILL IN YOUR END POINT>.openai.azure.com/"
API_KEY = os.getenv("OPENAI_API_KEY") 
RESOURCE_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT") 
openai.api_type = "azure"
openai.api_key = API_KEY
openai.api_base = RESOURCE_ENDPOINT
#openai.api_version = "2022-12-01"
#openai.api_version = "2023-03-15-preview"   
openai.api_version = "2023-07-01-preview"   

In [7]:
#Create a ChatCompletion Request
# Use LLM to generate response

#COMPLETIONS_MODEL = "text-davinci-003" #text generation 3.5 model name
#COMPLETIONS_MODEL = "gpt-3.5-turbo" #OpenAI Standalone chatgpt model name
#COMPLETIONS_MODEL = "gpt-35-turbo" #Azure OpenAI chatgpt model name 
COMPLETIONS_MODEL = "gpt-35-turbo-16k" #Azure OpenAI chatgpt model name 
#COMPLETIONS_MODEL = "gpt-4-32k" #"gpt-4" #Advanced GPT-4 model
EMBEDDING_MODEL = "text-embedding-ada-002"

engine = COMPLETIONS_MODEL

def generate_response(prompt):
    messages = [{"role": "user", "content": prompt}]
    response = openai.ChatCompletion.create(
                engine=engine,
                temperature=0,
                messages=messages,
                model=COMPLETIONS_MODEL
            )["choices"][0]["message"]["content"].strip(" \n")
    return response
  
#Print the response from the ChatCompletion request.
#prompt = 'how many planets in solar system'
#prompt='Explain concisely in three to five sentences what embedding technique used for openai embedding model text-embedding-ada-002'
#prompt='what is AI?'
#response = generate_response(prompt)
#print(f'GPT says: {response}')  

## 1. Human Evaluation: Hints to Detect AI-generated Content
### Tone & Style
- AI output is **robotic and uniform**, lacking transition words or varying tones.  
- AI-generated content is **monotone**, often lacks personal touch, opinion or emotion.  
- Whereas when people write, there are usually varying tones and styles throughout the tex. There are often shifts in thought patterns in humans, resulting in a change in tone.   
- E.g. sample testing question: `"write a paragragh about long-lasting flower species for garden"`
### Accuracy
- AI output is often incorrect, inconsistant, untrue or not up-to-date as GPT training data cutoff by Sept.2021 (so always enforce fact-checking mechanism) e.g. `"who is the boss of twitter?"` GPT gave an obsolete answer "Jack Dorsey"  
### Repetitive Language
- 'keyword stuffing' in AI output, e.g. ask GPT a sample question of `'what is AI?'` 

In [9]:
prompt='write a paragragh about long-lasting flower species for garden'
response = generate_response(prompt)
print(f'ChatGPT says: {response}')  

ChatGPT says: When it comes to creating a beautiful and vibrant garden, choosing long-lasting flower species is essential. These flowers not only add color and fragrance to your outdoor space but also ensure that your garden remains in bloom for an extended period. Some popular long-lasting flower species for gardens include roses, marigolds, petunias, and geraniums. Roses are known for their longevity and come in a variety of colors and sizes, making them a versatile choice. Marigolds are not only visually appealing but also repel pests, making them a practical addition to any garden. Petunias are low-maintenance and can withstand various weather conditions, making them perfect for beginners. Lastly, geraniums are known for their vibrant blooms and can thrive in both sun and shade. By incorporating these long-lasting flower species into your garden, you can enjoy a beautiful and flourishing outdoor space throughout the seasons.


In [10]:
prompt='who is the boss of twitter?'
response = generate_response(prompt)
print(f'ChatGPT says: {response}')  

ChatGPT says: As of my knowledge, Jack Dorsey is the CEO and co-founder of Twitter. However, please note that executive positions can change over time, so it's always a good idea to verify the latest information.


Note: Recent ChatGPT updates include **safety verbiage** indicating that the answer is relevant as of September 2021 when asking questions about current events. Although this update is helpful, it is still possible to get output that isn’t accurate.

In [8]:
prompt='what is AI?'
response = generate_response(prompt)
print(f'ChatGPT says: {response}')  

ChatGPT says: AI, or Artificial Intelligence, refers to the simulation of human intelligence in machines that are programmed to think and learn like humans. It involves the development of computer systems capable of performing tasks that typically require human intelligence, such as speech recognition, decision-making, problem-solving, and language translation. AI can be categorized into two types: Narrow AI, which is designed for specific tasks, and General AI, which possesses the ability to understand, learn, and apply knowledge across various domains.


## 2. AI Content Detection Tools

### The idea: use AI to detect AI-generated content
### How it works? 
- These tools analyze text using context to understand the likelihood of words appearing together.  
- **The more predictable the linguistic pattern, the more likely it’s AI-generated content**. Unlike humans, AI typically does not write more creative sentences, is more predictable, and doesn’t offer opinions.  

### Some typical tools
- Originality AI 
- GPTZero
- Content at Scale 
- Winston AI  
- Hugging Face    
- Other 'AI content detector' tool makers:   
- [Writer](https://writer.com/ai-content-detector/)  
- [Copyleaks](https://copyleaks.com/ai-content-detector)   
- [Sapling.ai](https://sapling.ai/)  
- [Crossplag AI](https://app.crossplag.com/sign-up)  
- [Corrector App](https://corrector.app/) 
- [Kazan SEO](https://kazanseo.com/)  
- And many more ...

## [Originality AI](https://originality.ai/)
- One of the leading tools, mainly created for identifying content generated through **GPT family like ChatGPT, GPT-2/-3/-4 and Bard** 
- Not free, but has a Chrome extension with 50 credits (100 words scan per credit)
- Content can be scanned by inputting a URL, uploading a file, or pasting text.

## [GPTZero](https://gptzero.me/)
- Another excellent tool, trained to detect ChatGPT, GPT4, Bard, LLaMa and other LLMs  
- Up to 5k characters detection for free   

## [Content at Scale](https://contentatscale.ai/ai-content-detector/)
- It has a **FREE** tool named `AI content detector`, capable of detecting ChatGPT, GPT4, Bard, Claude & More 
- Its content score is based on the text’s predictability, probability and pattern  
- It can be used to scan various types of text, including **SEO**, educational, marketing and academic content.
- API is available on request  
- It can scan up to 25,000 character text  
- Multiple language support  
- Also has a **FREE** [AI image detector](https://contentatscale.ai/ai-image-detector/) - nice!

## [Winston AI](https://gowinston.ai/)
- Can detect content generated with ChatGPT, GPT-4, Bard and many more LLMs   
- Support OCR (Optical Character Recognition) to scan images and handwriting  
- Offer a **free** plan (up to 2k words scan) and two paid plans

## [Hugging Face](https://huggingface.co/spaces/PirateXX/AI-Content-Detector)
- Hugging Face is a platform to help AI researchers develop and deploy DL models   
- HF has its `'AI-Content-Detector'` tool, which is **free to use, open-sourced and has community support!**  
- Has a more user-friendly free web app, which analyzes text based on patterns and perplexity  
- Note: `perplexity` measures text randomness in NLP study

## 3. Challenge: "Reliably detecting AI-generated text is mathematically impossible. "
- Refer to [Reliably detecting AI-generated text is mathematically impossible](https://www.linkedin.com/pulse/reliably-detecting-ai-generated-text-mathematically-sayeed/): <i>"The complexity and sophistication of modern AI models, coupled with their ability to mimic human writing styles, make it extremely difficult to distinguish AI-generated text from human-generated content."</i>

- <i>"Detection methods based on **statistical features or linguistic patterns** are easily overcome by more advanced AI models. Additionally, the constant evolution of AI technology and the potential for adversarial attacks further complicate the task of detection."<i>
    
- Moreover, <i>"OpenAI says **no AI tool is reliable enough to distinguish** if the content is AI-generated"<i>, refer to [OpenAI warns teachers that ChatGPT cannot detect AI-generated content, but offers tricks to identify copied answers](https://www.businesstoday.in/technology/news/story/openai-warns-teachers-that-chatgpt-cannot-detect-ai-generated-content-but-offers-tricks-to-identify-copied-answers-396723-2023-09-04)
    
#### Background info about 'Adversarial attack'
- <i>"Adversarial attacks involve intentionally modifying or manipulating (e.g. **re-write/paraphrase**) AI-generated text to make it more difficult to detect. By introducing subtle changes or perturbations, attackers can effectively camouflage AI-generated content, making it virtually impossible to distinguish from human-generated text."<i>
    
- An actaul example refers to [New AI Detection Bypassing Tool, UndetectableAI, Hits the Market](https://www.kentuckytoday.com/news/national/new-ai-detection-bypassing-tool-undetectableai-hits-the-market/article_32480a33-d83a-5175-88cd-1a6739c4fa75.html#:~:text=Using%20UndetectableAI%20to%20bypass%20AI,bypass%20most%20AI%20writing%20checks): <i>"Using UndetectableAI to bypass AI detection is simple. To use it, its users simply need to enter their text that has been created by an AI tool and have the tool rewrite the text with a single click. They will then instantly be given the rewritten content that is able to bypass most AI writing checks."<i> (note: this website cannot be accessed)
    
## 4. Best Practice for AI content Detection
- A hybrid human+AI approach is proposed, using both **a trained eye and an AI-powered tool** to get the best results